I/D/E · Generative AI

Probability Basics

Summary

Foundation for softmax, cross-entropy, temperature scaling, and sampling in AI systems

TL;DR

Probability distributions assign likelihoods to outcomes. For AI engineering, understanding softmax, entropy, cross-entropy, and KL divergence is essential for working with model outputs, loss functions, and temperature scaling.

Visual Overview

Model Output to Probabilities

                                                           
   Raw model output (logits):                              
     [2.1, 0.5, -0.3]     Just numbers, not probs         
                                                           
   After softmax:                                          
     [0.72, 0.15, 0.03]   Probability distribution        
                                                           
   Properties:                                             
     • Each value in [0, 1]                                
Sum = 1.0 (certainty is distributed)                
                                                           


The Softmax Function

Converts raw scores into a probability distribution:

Softmax Formula

                                                           
   softmax(x_i) = exp(x_i) / SUM(exp(x_j))                 
                                                           
   Why exp()?                                              
Makes all values positive                           
Preserves relative ordering                         
Amplifies differences                               
                                                           

When a model outputs [0.72, 0.15, 0.03], it’s saying: “I’m 72% confident this is a cat, 15% dog, 3% bird.”


Expected Value

The expected value is the weighted average of outcomes.

Expected Value

                                                           
   E[X] = SUM(outcome × probability)                       
                                                           
   Example: Rolling a fair die                             
     E[X] = (1×1/6) + (2×1/6) + ... + (6×1/6) = 3.5        
                                                           
   For model outputs:                                      
     If rewards = [10, 5, 1] and P = [0.72, 0.15, 0.03]    
     E[reward] = (10×0.72) + (5×0.15) + (1×0.03) = 7.98    
                                                           

Why it matters: Loss functions compute expected loss. Training minimizes expected error across the dataset.


Entropy: Average Surprise

Entropy measures uncertainty in a distribution. High entropy = uncertain. Low entropy = confident.

Entropy
ENTROPY INTUITION

                                                           
   "Surprise" of an event = -log(P)                        
                                                           
   Low probability   high surprise   P=0.01  4.6         
   High probability  low surprise    P=0.99  0.01        
                                                           
   Entropy = Expected surprise = average across outcomes   
                                                           
   H(P) = -SUM(P(x) × log(P(x)))                           
                                                           


ENTROPY EXAMPLES

 
 Uniform distribution (maximum uncertainty): 
 P = [0.25, 0.25, 0.25, 0.25] 
 H = 1.39 bits 
 "Model has no idea, all options equally likely" 
 
 Peaked distribution (confident): 
 P = [0.97, 0.01, 0.01, 0.01] 
 H = 0.24 bits 
 "Model is pretty sure it's the first option" 
 
 One-hot distribution (certain): 
 P = [1.0, 0.0, 0.0, 0.0] 
 H = 0 bits 
 "Model is certain" 
 

Why it matters:

  • Entropy of model output tells you confidence
  • Temperature scaling manipulates entropy (higher temp = more uniform)
  • Perplexity = exp(entropy) — “how many choices is the model confused between?”

Cross-Entropy: Comparing Distributions

Cross-entropy measures how well distribution Q predicts distribution P. Used as the standard classification loss.

Cross-Entropy

                                                           
   H(P, Q) = -SUM(P(x) × log(Q(x)))                        
                                                           
   Where:                                                  
     P = true distribution (ground truth)                  
     Q = predicted distribution (model output)             
                                                           


CROSS-ENTROPY AS LOSS

 
 True label: "cat"  P = [1, 0, 0] (one-hot) 
 Model prediction: Q = [0.7, 0.2, 0.1] 
 
 H(P, Q) = -[1×log(0.7) + 0×log(0.2) + 0×log(0.1)] 
 = -log(0.7) 
 = 0.36 
 
 Only the true class matters! Simplifies to: 
 Loss = -log(P_correct) 
 
 Punishes confident wrong predictions severely: 
 If Q = [0.01, 0.98, 0.01] for true class cat: 
 Loss = -log(0.01) = 4.6  Much higher! 
 

Key insight: Cross-entropy penalizes low confidence in the correct answer. A model that puts 1% probability on the right answer pays a huge penalty.


KL Divergence: Distance Between Distributions

KL divergence measures how different two distributions are. It’s not symmetric.

KL Divergence

                                                           
   DKL(P || Q) = SUM(P(x) × log(P(x)/Q(x)))                
                                                           
   Also written as:                                        
     DKL(P || Q) = H(P, Q) - H(P)                          
                 = Cross-entropy - Entropy                 
                                                           
   "Extra bits needed to encode P using Q's distribution"  
                                                           

Where you’ll see it:

ContextWhat it measures
Fine-tuning with KL penaltyHow far fine-tuned model drifted from base
Knowledge distillationHow well student matches teacher
VAEs, diffusion modelsDifference from prior distribution

Practical note: KL divergence of 0 means distributions are identical. Larger values mean more different.


Temperature Scaling

Temperature Effect on Softmax

                                                           
   softmax(x / T) where T = temperature                    
                                                           
   T = 1.0  standard softmax                              
   T > 1.0  softer distribution (more random sampling)    
   T < 1.0  sharper distribution (more deterministic)     
   T  0    argmax (always pick highest)                  
                                                           
                    
    T=0.5:  [0.88, 0.10, 0.02]  sharp                    
    T=1.0:  [0.66, 0.24, 0.10]  normal                   
    T=2.0:  [0.49, 0.31, 0.20]  soft                     
                    
                                                           
   Higher temperature = higher entropy = more "creative"   
   Lower temperature = lower entropy = more "focused"      
                                                           


Numerical Stability

Numerical Stability
COMMON GOTCHA

                                                           
   Problem: log(0) = negative infinity                     
                                                           
   Solution: Add small epsilon                             
     log(P + 1e-10)  or  max(P, 1e-10)                     
                                                           
   In practice: Use framework's built-in cross_entropy     
                It handles numerical stability for you     
                                                           


When This Matters

SituationConcept to apply
Understanding model confidenceSoftmax outputs as probabilities
Tuning temperature for generationHigher temp = higher entropy = more random
Understanding perplexity scoresPerplexity = exp(cross-entropy)
Debugging “model too confident”Look at entropy of outputs
Fine-tuning with KL penaltyConstrains drift from base model
Understanding why cross-entropy worksIt heavily penalizes confident mistakes

Production signal

Why this concept matters

Interview 70% of ML interviews
Production Every LLM application
Performance Understanding sampling and temperature