I/D/E · Generative AI

What Is a Model?

Summary

Foundation vocabulary for machine learning: parameters, weights, logits, training vs inference, and why neural networks work

TL;DR

A machine learning model is a mathematical function that maps inputs to outputs, with learnable parameters that are adjusted during training. Understanding parameters, logits, training vs inference, and the bias-variance tradeoff is essential vocabulary for any AI engineering work.

Visual Overview

Model As Function

                                                           
   Input (X)           Model          Output (Y)         
          
   "Is this spam?"       f(x)          0.92 (yes)          
   [image pixels]        f(x)          "cat"               
   "Translate this"      f(x)          "Bonjour"           
                                                           
   The function has PARAMETERS—numbers that determine      
   how inputs map to outputs. Training adjusts them.       
                                                           


UNTRAINED VS TRAINED

 
 Untrained model: random parameters  garbage output 
 Trained model: learned parameters  useful output 
 


Parameters, Weights, and Biases

Parameters are the learnable values inside a model.

Parameters and Scale
SIMPLE LINEAR MODEL

                                                           
   y = w × x + b                                           
                                                           
   w = weight (how much x matters)                         
   b = bias (baseline offset)                              
                                                           
   These are parameters. Training finds good values.       
                                                           


MODERN LLM SCALE

 
 GPT-2: 124 million parameters 
 GPT-3: 175 billion parameters 
 LLaMA 70B: 70 billion parameters 
 Claude: undisclosed, but similar scale 
 
 Each parameter is a floating-point number. 
 More parameters = more capacity to learn patterns. 
 


Logits

Logits are raw, unnormalized scores output by a model before converting to probabilities.

Logits to Probabilities

                                                           
   Model outputs logits:                                   
     "cat":  4.2                                           
     "dog":  2.1                                           
     "car": -1.3                                           
                                                           
   These are arbitrary numbers. Higher = more likely.      
                                                           
   Apply softmax to convert to probabilities:              
     "cat":  0.89                                          
     "dog":  0.10                                          
     "car":  0.01                                          
                                                           
   Now they sum to 1 and represent confidence.             
                                                           

Why logits matter:

  • LLMs output logits over vocabulary (50,000+ scores)
  • Temperature and sampling operate on logits
  • Understanding logits helps debug generation issues

Deterministic vs Probabilistic vs Statistical

Three Types of Systems

                                                           
   DETERMINISTIC: Same input  same output, always         
                                                           
     def deterministic(x):                                 
         return x * 2 + 5                                  
                                                           
     deterministic(3)  # Always 11                         
                                                           

                                                           
   PROBABILISTIC: Output includes randomness               
                                                           
     def probabilistic(x):                                 
         return x * 2 + 5 + random.gauss(0, 1)             
                                                           
     probabilistic(3)  # 11.23 one time, 10.87 next        
                                                           

                                                           
   STATISTICAL: Learns patterns from data                  
                                                           
     model = LinearRegression()                            
     model.fit(X_train, y_train)  # Learns from data       
     model.predict(X_new)         # Generalizes            
                                                           

ML models are statistical systems that often use probabilistic methods:

  • They learn from data (statistical)
  • They may sample from distributions (probabilistic)
  • Given same input + same random seed, they’re deterministic

LLMs with temperature > 0 are probabilistic. With temperature = 0, they’re deterministic.


Training vs Inference

Training vs Inference
TRAINING

                                                           
   Training loop:                                          
     1. Forward pass: compute prediction                   
     2. Compute loss: how wrong is it?                     
     3. Backward pass: compute gradients                   
     4. Update parameters: reduce error                    
     5. Repeat millions of times                           
                                                           
   Training is:                                            
Expensive (weeks on GPU clusters)                   
     • Done once (or periodically)                         
     • Requires labeled data                               
                                                           


INFERENCE

 
 Inference: 
 1. Load trained parameters 
 2. Forward pass only 
 3. Return prediction 
 
 Inference is: 
Cheap (milliseconds per prediction) 
 • Done continuously in production 
 • No parameter updates 
 

You will mostly do inference. Training LLMs requires massive compute. Fine-tuning is more accessible but still expensive.


Why Neural Networks Work

Neural networks are function approximators. Given enough parameters, they can learn any pattern in data.

Why Neural Networks Work
UNIVERSAL APPROXIMATION

                                                           
   Theorem: A neural network with sufficient neurons       
   can approximate any continuous function to arbitrary    
   precision.                                              
                                                           
   Translation: If a pattern exists in your data,          
   a big enough network can learn it.                      
                                                           


WHY DEPTH MATTERS

 
 Shallow network (1-2 layers): 
 • Can approximate functions 
 • Needs exponentially many neurons 
 
 Deep network (many layers): 
 • Learns hierarchical features 
 • Layer 1: edges 
 • Layer 2: shapes 
 • Layer 3: objects 
 • Layer 4: scenes 
 
 Each layer builds on previous. Composition is powerful. 
 


The Bias-Variance Tradeoff

Two failure modes when learning from data:

Bias-Variance Tradeoff
UNDERFITTING (HIGH BIAS)

                                                           
   Training error: High                                    
   Test error: High                                        
   Problem: Model too simple, misses patterns              
   Solution: More capacity (bigger model, more features)   
                                                           


OVERFITTING (HIGH VARIANCE)

 
 Training error: Low (perfect!) 
 Test error: High (fails on new data) 
 Problem: Model memorized noise, not signal 
 Solution: Regularization, more data, simpler model 
 


THE SWEET SPOT

 
  
 Error  / Test error 
  / 
  / 
  /  Sweet spot 
  /  
  /  Training error 
  
 Model Complexity  
 


Vocabulary Reference

TermDefinition
ModelMathematical function mapping inputs to outputs
ParametersLearnable values inside the model (weights + biases)
WeightsParameters in connections between neurons
BiasOffset parameter added to weighted sum
FeaturesMeasurable properties of input data
LogitsRaw, unnormalized output scores
SoftmaxConverts logits to probabilities (sum to 1)
TrainingAdjusting parameters to minimize error
InferenceUsing trained model to make predictions
LossMeasure of how wrong predictions are
GradientDirection to adjust parameters to reduce loss
EpochOne pass through entire training dataset

When This Matters

SituationWhat to know
Discussing model sizeParameters = capacity, larger = more memory
Debugging generationTemperature affects logit sampling
Understanding trainingForward pass -> loss -> backward pass -> update
Production deploymentInference only, no training overhead
Model selectionBias-variance tradeoff guides complexity choice

Production signal

Why this concept matters

Interview 80% of ML interviews
Production Every ML system
Performance Foundation for all AI work