Skip to content

Probability Basics

Foundation for softmax, cross-entropy, temperature scaling, and sampling in AI systems

TL;DR

Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.

Visual Overview

Model Output to Probabilities

The Softmax Function

Converts raw scores into a probability distribution:

Softmax Formula

When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”


Expected Value

The expected value is the weighted average of outcomes.

Expected Value

Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.


Entropy: Average Surprise

Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.

Entropy

Why it matters:

  • Entropy of model output tells you confidence
  • Temperature scaling manipulates entropy (higher temp = more uniform)
  • Perplexity = exp(entropy) — “how many choices is the model confused between?”

Cross-Entropy: Comparing Distributions

Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.

Cross-Entropy

Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.


KL Divergence: Distance Between Distributions

KL divergence measures how different two distributions are. It’s not symmetric.

KL Divergence

Where you’ll see it:

ContextWhat it measures
Fine-tuning with KL penaltyHow far fine-tuned model drifted from base
Knowledge distillationHow well student matches teacher
VAEs, diffusion modelsDifference from prior distribution

Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.


Temperature Scaling

Temperature Effect on Softmax

Numerical Stability

Numerical Stability

When This Matters

SituationConcept to apply
Understanding model confidenceSoftmax outputs as probabilities
Tuning temperature for generationHigher temp = higher entropy = more random
Understanding perplexity scoresPerplexity = exp(cross-entropy)
Debugging “model too confident”Look at entropy of outputs
Fine-tuning with KL penaltyConstrains drift from base model
Understanding why cross-entropy worksIt heavily penalizes confident mistakes
Interview Notes
💼70% of ML interviews
Interview Relevance
70% of ML interviews
🏭Every LLM application
Production Impact
Powers systems at Every LLM application
Understanding sampling and temperature
Performance
Understanding sampling and temperature query improvement