Probability Basics

TL;DR

Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.

Visual Overview

Model Output to Probabilities

The Softmax Function

Converts raw scores into a probability distribution:

Softmax Formula

When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”

Expected Value

The expected value is the weighted average of outcomes.

Expected Value

Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.

Entropy: Average Surprise

Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.

Entropy

Why it matters:

Entropy of model output tells you confidence
Temperature scaling manipulates entropy (higher temp = more uniform)
Perplexity = exp(entropy) — “how many choices is the model confused between?”

Cross-Entropy: Comparing Distributions

Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.

Cross-Entropy

Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.

KL Divergence: Distance Between Distributions

KL divergence measures how different two distributions are. It’s not symmetric.

KL Divergence

Where you’ll see it:

Context	What it measures
Fine-tuning with KL penalty	How far fine-tuned model drifted from base
Knowledge distillation	How well student matches teacher
VAEs, diffusion models	Difference from prior distribution

Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.

Temperature Scaling

Temperature Effect on Softmax

Numerical Stability

When This Matters

Situation	Concept to apply
Understanding model confidence	Softmax outputs as probabilities
Tuning temperature for generation	Higher temp = higher entropy = more random
Understanding perplexity scores	Perplexity = exp(cross-entropy)
Debugging “model too confident”	Look at entropy of outputs
Fine-tuning with KL penalty	Constrains drift from base model
Understanding why cross-entropy works	It heavily penalizes confident mistakes