Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.
Visual Overview
Model Output to Probabilities
Model Output to Probabilities
MODEL OUTPUT TO PROBABILITIES┌───────────────────────────────────────────────────────────┐│││ Raw model output (logits): ││ [2.1, 0.5, -0.3] ← Just numbers, not probs ││││ After softmax: ││ [0.72, 0.15, 0.03] ←Probability distribution││││ Properties: ││ • Each value in [0, 1] ││ • Sum = 1.0 (certainty is distributed) │││└───────────────────────────────────────────────────────────┘
The Softmax Function
Converts raw scores into a probability distribution:
When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”
Expected Value
The expected value is the weighted average of outcomes.
Expected Value
Expected Value
EXPECTED VALUE┌───────────────────────────────────────────────────────────┐│││ E[X] = SUM(outcome × probability) ││││ Example: Rolling a fair die ││ E[X] = (1×1/6) + (2×1/6) + ... + (6×1/6) = 3.5 ││││ For model outputs: ││ If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03] ││ E[reward] = (10×0.72) + (5×0.15) + (1×0.03) = 7.98 │││└───────────────────────────────────────────────────────────┘
Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.
Entropy: Average Surprise
Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.
Entropy
Entropy
ENTROPY INTUITION┌───────────────────────────────────────────────────────────┐│││ "Surprise" of an event = -log(P) ││││ Low probability → high surprise P=0.01 → 4.6 ││ High probability → low surprise P=0.99 → 0.01 ││││ Entropy = Expected surprise = average across outcomes ││││ H(P) = -SUM(P(x) × log(P(x))) │││└───────────────────────────────────────────────────────────┘ENTROPY EXAMPLES┌───────────────────────────────────────────────────────────┐│││Uniform distribution (maximum uncertainty): ││ P = [0.25, 0.25, 0.25, 0.25] ││ H = 1.39 bits ││ "Model has no idea, all options equally likely" ││││Peaked distribution (confident): ││ P = [0.97, 0.01, 0.01, 0.01] ││ H = 0.24 bits ││ "Model is pretty sure it's the first option" ││││One-hot distribution (certain): ││ P = [1.0, 0.0, 0.0, 0.0] ││ H = 0 bits ││ "Model is certain" │││└───────────────────────────────────────────────────────────┘
Why it matters:
Entropy of model output tells you confidence
Temperature scaling manipulates entropy (higher temp = more uniform)
Perplexity = exp(entropy) — “how many choices is the model confused between?”
Cross-Entropy: Comparing Distributions
Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.
Cross-Entropy
Cross-Entropy
CROSS-ENTROPY┌───────────────────────────────────────────────────────────┐│││ H(P, Q) = -SUM(P(x) × log(Q(x))) ││││ Where: ││ P = true distribution (ground truth) ││ Q = predicted distribution (model output) │││└───────────────────────────────────────────────────────────┘CROSS-ENTROPY AS LOSS
┌───────────────────────────────────────────────────────────┐│││ True label: "cat" → P = [1, 0, 0] (one-hot) ││ Model prediction: Q = [0.7, 0.2, 0.1] ││││ H(P, Q) = -[1×log(0.7) + 0×log(0.2) + 0×log(0.1)] ││ = -log(0.7) ││ = 0.36 ││││ Only the true class matters! Simplifies to: ││ Loss = -log(P_correct) ││││Punishes confident wrong predictions severely: ││ If Q = [0.01, 0.98, 0.01] for true class cat: ││ Loss = -log(0.01) = 4.6 ←Much higher!│││└───────────────────────────────────────────────────────────┘
Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.
KL Divergence: Distance Between Distributions
KL divergence measures how different two distributions are. It’s not symmetric.
KL Divergence
KL Divergence
KL DIVERGENCE┌───────────────────────────────────────────────────────────┐│││ DKL(P || Q) = SUM(P(x) × log(P(x)/Q(x))) ││││ Also written as: ││ DKL(P || Q) = H(P, Q) - H(P) ││ = Cross-entropy - Entropy││││ "Extra bits needed to encode P using Q's distribution" │││└───────────────────────────────────────────────────────────┘
Where you’ll see it:
Context
What it measures
Fine-tuning with KL penalty
How far fine-tuned model drifted from base
Knowledge distillation
How well student matches teacher
VAEs, diffusion models
Difference from prior distribution
Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.
Temperature Scaling
Temperature Effect on Softmax
Temperature Effect on Softmax
TEMPERATURE EFFECT ON SOFTMAX┌───────────────────────────────────────────────────────────┐│││ softmax(x / T) where T = temperature││││ T = 1.0 → standard softmax ││ T > 1.0 → softer distribution (more random sampling) ││ T < 1.0 → sharper distribution (more deterministic) ││ T → 0 → argmax (always pick highest) ││││┌─────────────────────────────────────┐│││ T=0.5: [0.88, 0.10, 0.02] sharp ││││ T=1.0: [0.66, 0.24, 0.10] normal ││││ T=2.0: [0.49, 0.31, 0.20] soft │││└─────────────────────────────────────┘││││ Higher temperature = higher entropy = more "creative" ││ Lower temperature = lower entropy = more "focused" │││└───────────────────────────────────────────────────────────┘
Numerical Stability
Numerical Stability
Numerical Stability
COMMON GOTCHA┌───────────────────────────────────────────────────────────┐│││ Problem: log(0) = negative infinity││││ Solution: Add small epsilon││ log(P + 1e-10) or max(P, 1e-10) ││││ In practice: Use framework's built-in cross_entropy ││ It handles numerical stability for you │││└───────────────────────────────────────────────────────────┘
When This Matters
Situation
Concept to apply
Understanding model confidence
Softmax outputs as probabilities
Tuning temperature for generation
Higher temp = higher entropy = more random
Understanding perplexity scores
Perplexity = exp(cross-entropy)
Debugging “model too confident”
Look at entropy of outputs
Fine-tuning with KL penalty
Constrains drift from base model
Understanding why cross-entropy works
It heavily penalizes confident mistakes
Interview Notes
💼70% of ML interviews
Interview Relevance 70% of ML interviews
🏭Every LLM application
Production Impact
Powers systems at Every LLM application
⚡Understanding sampling and temperature
Performance Understanding sampling and temperature query improvement