TL;DR
Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ Raw model output (logits): │ │ [2.1, 0.5, -0.3] ← Just numbers, not probs │ │ │ │ After softmax: │ │ [0.72, 0.15, 0.03] ← Probability distribution │ │ │ │ Properties: │ │ • Each value in [0, 1] │ │ • Sum = 1.0 (certainty is distributed) │ │ │ └───────────────────────────────────────────────────────────┘
The Softmax Function
Converts raw scores into a probability distribution:
┌───────────────────────────────────────────────────────────┐ │ │ │ softmax(x_i) = exp(x_i) / SUM(exp(x_j)) │ │ │ │ Why exp()? │ │ • Makes all values positive │ │ • Preserves relative ordering │ │ • Amplifies differences │ │ │ └───────────────────────────────────────────────────────────┘
When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”
Expected Value
The expected value is the weighted average of outcomes.
┌───────────────────────────────────────────────────────────┐ │ │ │ E[X] = SUM(outcome × probability) │ │ │ │ Example: Rolling a fair die │ │ E[X] = (1×1/6) + (2×1/6) + ... + (6×1/6) = 3.5 │ │ │ │ For model outputs: │ │ If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03] │ │ E[reward] = (10×0.72) + (5×0.15) + (1×0.03) = 7.98 │ │ │ └───────────────────────────────────────────────────────────┘
Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.
Entropy: Average Surprise
Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.
ENTROPY INTUITION ┌───────────────────────────────────────────────────────────┐ │ │ │ "Surprise" of an event = -log(P) │ │ │ │ Low probability → high surprise P=0.01 → 4.6 │ │ High probability → low surprise P=0.99 → 0.01 │ │ │ │ Entropy = Expected surprise = average across outcomes │ │ │ │ H(P) = -SUM(P(x) × log(P(x))) │ │ │ └───────────────────────────────────────────────────────────┘ ENTROPY EXAMPLES ┌───────────────────────────────────────────────────────────┐ │ │ │ Uniform distribution (maximum uncertainty): │ │ P = [0.25, 0.25, 0.25, 0.25] │ │ H = 1.39 bits │ │ "Model has no idea, all options equally likely" │ │ │ │ Peaked distribution (confident): │ │ P = [0.97, 0.01, 0.01, 0.01] │ │ H = 0.24 bits │ │ "Model is pretty sure it's the first option" │ │ │ │ One-hot distribution (certain): │ │ P = [1.0, 0.0, 0.0, 0.0] │ │ H = 0 bits │ │ "Model is certain" │ │ │ └───────────────────────────────────────────────────────────┘
Why it matters:
- Entropy of model output tells you confidence
- Temperature scaling manipulates entropy (higher temp = more uniform)
- Perplexity = exp(entropy) — “how many choices is the model confused between?”
Cross-Entropy: Comparing Distributions
Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.
┌───────────────────────────────────────────────────────────┐ │ │ │ H(P, Q) = -SUM(P(x) × log(Q(x))) │ │ │ │ Where: │ │ P = true distribution (ground truth) │ │ Q = predicted distribution (model output) │ │ │ └───────────────────────────────────────────────────────────┘ CROSS-ENTROPY AS LOSS ┌───────────────────────────────────────────────────────────┐ │ │ │ True label: "cat" → P = [1, 0, 0] (one-hot) │ │ Model prediction: Q = [0.7, 0.2, 0.1] │ │ │ │ H(P, Q) = -[1×log(0.7) + 0×log(0.2) + 0×log(0.1)] │ │ = -log(0.7) │ │ = 0.36 │ │ │ │ Only the true class matters! Simplifies to: │ │ Loss = -log(P_correct) │ │ │ │ Punishes confident wrong predictions severely: │ │ If Q = [0.01, 0.98, 0.01] for true class cat: │ │ Loss = -log(0.01) = 4.6 ← Much higher! │ │ │ └───────────────────────────────────────────────────────────┘
Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.
KL Divergence: Distance Between Distributions
KL divergence measures how different two distributions are. It’s not symmetric.
┌───────────────────────────────────────────────────────────┐ │ │ │ DKL(P || Q) = SUM(P(x) × log(P(x)/Q(x))) │ │ │ │ Also written as: │ │ DKL(P || Q) = H(P, Q) - H(P) │ │ = Cross-entropy - Entropy │ │ │ │ "Extra bits needed to encode P using Q's distribution" │ │ │ └───────────────────────────────────────────────────────────┘
Where you’ll see it:
| Context | What it measures |
|---|---|
| Fine-tuning with KL penalty | How far fine-tuned model drifted from base |
| Knowledge distillation | How well student matches teacher |
| VAEs, diffusion models | Difference from prior distribution |
Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.
Temperature Scaling
┌───────────────────────────────────────────────────────────┐ │ │ │ softmax(x / T) where T = temperature │ │ │ │ T = 1.0 → standard softmax │ │ T > 1.0 → softer distribution (more random sampling) │ │ T < 1.0 → sharper distribution (more deterministic) │ │ T → 0 → argmax (always pick highest) │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ T=0.5: [0.88, 0.10, 0.02] sharp │ │ │ │ T=1.0: [0.66, 0.24, 0.10] normal │ │ │ │ T=2.0: [0.49, 0.31, 0.20] soft │ │ │ └─────────────────────────────────────┘ │ │ │ │ Higher temperature = higher entropy = more "creative" │ │ Lower temperature = lower entropy = more "focused" │ │ │ └───────────────────────────────────────────────────────────┘
Numerical Stability
COMMON GOTCHA ┌───────────────────────────────────────────────────────────┐ │ │ │ Problem: log(0) = negative infinity │ │ │ │ Solution: Add small epsilon │ │ log(P + 1e-10) or max(P, 1e-10) │ │ │ │ In practice: Use framework's built-in cross_entropy │ │ It handles numerical stability for you │ │ │ └───────────────────────────────────────────────────────────┘
When This Matters
| Situation | Concept to apply |
|---|---|
| Understanding model confidence | Softmax outputs as probabilities |
| Tuning temperature for generation | Higher temp = higher entropy = more random |
| Understanding perplexity scores | Perplexity = exp(cross-entropy) |
| Debugging “model too confident” | Look at entropy of outputs |
| Fine-tuning with KL penalty | Constrains drift from base model |
| Understanding why cross-entropy works | It heavily penalizes confident mistakes |
Production signal