Probability Basics

TL;DR

Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.

Visual Overview

Model Output to Probabilities

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Raw model output (logits):                              │
│     [2.1, 0.5, -0.3]    ← Just numbers, not probs         │
│                                                           │
│   After softmax:                                          │
│     [0.72, 0.15, 0.03]  ← Probability distribution        │
│                                                           │
│   Properties:                                             │
│     • Each value in [0, 1]                                │
│     • Sum = 1.0 (certainty is distributed)                │
│                                                           │
└───────────────────────────────────────────────────────────┘

The Softmax Function

Converts raw scores into a probability distribution:

Softmax Formula

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   softmax(x_i) = exp(x_i) / SUM(exp(x_j))                 │
│                                                           │
│   Why exp()?                                              │
│     • Makes all values positive                           │
│     • Preserves relative ordering                         │
│     • Amplifies differences                               │
│                                                           │
└───────────────────────────────────────────────────────────┘

When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”

Expected Value

The expected value is the weighted average of outcomes.

Expected Value

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   E[X] = SUM(outcome × probability)                       │
│                                                           │
│   Example: Rolling a fair die                             │
│     E[X] = (1×1/6) + (2×1/6) + ... + (6×1/6) = 3.5        │
│                                                           │
│   For model outputs:                                      │
│     If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03]    │
│     E[reward] = (10×0.72) + (5×0.15) + (1×0.03) = 7.98    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.

Entropy: Average Surprise

Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.

Entropy

ENTROPY INTUITION
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   "Surprise" of an event = -log(P)                        │
│                                                           │
│   Low probability  → high surprise   P=0.01 → 4.6         │
│   High probability → low surprise    P=0.99 → 0.01        │
│                                                           │
│   Entropy = Expected surprise = average across outcomes   │
│                                                           │
│   H(P) = -SUM(P(x) × log(P(x)))                           │
│                                                           │
└───────────────────────────────────────────────────────────┘

ENTROPY EXAMPLES
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Uniform distribution (maximum uncertainty):               │
│ P = [0.25, 0.25, 0.25, 0.25]                              │
│ H = 1.39 bits                                             │
│ "Model has no idea, all options equally likely"           │
│                                                           │
│ Peaked distribution (confident):                          │
│ P = [0.97, 0.01, 0.01, 0.01]                              │
│ H = 0.24 bits                                             │
│ "Model is pretty sure it's the first option"              │
│                                                           │
│ One-hot distribution (certain):                           │
│ P = [1.0, 0.0, 0.0, 0.0]                                  │
│ H = 0 bits                                                │
│ "Model is certain"                                        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why it matters:

Entropy of model output tells you confidence
Temperature scaling manipulates entropy (higher temp = more uniform)
Perplexity = exp(entropy) — “how many choices is the model confused between?”

Cross-Entropy: Comparing Distributions

Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.

Cross-Entropy

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   H(P, Q) = -SUM(P(x) × log(Q(x)))                        │
│                                                           │
│   Where:                                                  │
│     P = true distribution (ground truth)                  │
│     Q = predicted distribution (model output)             │
│                                                           │
└───────────────────────────────────────────────────────────┘

CROSS-ENTROPY AS LOSS
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ True label: "cat" → P = [1, 0, 0] (one-hot)               │
│ Model prediction: Q = [0.7, 0.2, 0.1]                     │
│                                                           │
│ H(P, Q) = -[1×log(0.7) + 0×log(0.2) + 0×log(0.1)]         │
│ = -log(0.7)                                               │
│ = 0.36                                                    │
│                                                           │
│ Only the true class matters! Simplifies to:               │
│ Loss = -log(P_correct)                                    │
│                                                           │
│ Punishes confident wrong predictions severely:            │
│ If Q = [0.01, 0.98, 0.01] for true class cat:             │
│ Loss = -log(0.01) = 4.6 ← Much higher!                    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.

KL Divergence: Distance Between Distributions

KL divergence measures how different two distributions are. It’s not symmetric.

KL Divergence

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   DKL(P || Q) = SUM(P(x) × log(P(x)/Q(x)))                │
│                                                           │
│   Also written as:                                        │
│     DKL(P || Q) = H(P, Q) - H(P)                          │
│                 = Cross-entropy - Entropy                 │
│                                                           │
│   "Extra bits needed to encode P using Q's distribution"  │
│                                                           │
└───────────────────────────────────────────────────────────┘

Where you’ll see it:

Context	What it measures
Fine-tuning with KL penalty	How far fine-tuned model drifted from base
Knowledge distillation	How well student matches teacher
VAEs, diffusion models	Difference from prior distribution

Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.

Temperature Scaling

Temperature Effect on Softmax

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   softmax(x / T) where T = temperature                    │
│                                                           │
│   T = 1.0 → standard softmax                              │
│   T > 1.0 → softer distribution (more random sampling)    │
│   T < 1.0 → sharper distribution (more deterministic)     │
│   T → 0   → argmax (always pick highest)                  │
│                                                           │
│   ┌─────────────────────────────────────┐                 │
│   │ T=0.5:  [0.88, 0.10, 0.02]  sharp   │                 │
│   │ T=1.0:  [0.66, 0.24, 0.10]  normal  │                 │
│   │ T=2.0:  [0.49, 0.31, 0.20]  soft    │                 │
│   └─────────────────────────────────────┘                 │
│                                                           │
│   Higher temperature = higher entropy = more "creative"   │
│   Lower temperature = lower entropy = more "focused"      │
│                                                           │
└───────────────────────────────────────────────────────────┘

Numerical Stability

COMMON GOTCHA
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Problem: log(0) = negative infinity                     │
│                                                           │
│   Solution: Add small epsilon                             │
│     log(P + 1e-10)  or  max(P, 1e-10)                     │
│                                                           │
│   In practice: Use framework's built-in cross_entropy     │
│                It handles numerical stability for you     │
│                                                           │
└───────────────────────────────────────────────────────────┘

When This Matters

Situation	Concept to apply
Understanding model confidence	Softmax outputs as probabilities
Tuning temperature for generation	Higher temp = higher entropy = more random
Understanding perplexity scores	Perplexity = exp(cross-entropy)
Debugging “model too confident”	Look at entropy of outputs
Fine-tuning with KL penalty	Constrains drift from base model
Understanding why cross-entropy works	It heavily penalizes confident mistakes

TL;DR

Visual Overview

The Softmax Function

Expected Value

Entropy: Average Surprise

Cross-Entropy: Comparing Distributions

KL Divergence: Distance Between Distributions

Temperature Scaling

Numerical Stability

When This Matters

Why this concept matters