A machine learning model is a mathematical function that maps inputs to outputs, with learnable parameters that are adjusted during training. Understanding parameters, logits, training vs inference, and the bias-variance tradeoff is essential vocabulary for any AI engineering work.
Visual Overview
Model As Function
Model As Function
MODEL AS FUNCTION┌───────────────────────────────────────────────────────────┐│││Input (X) → Model →Output (Y) ││─────────────────────────────────────────────────││ "Is this spam?" f(x) 0.92 (yes) ││ [image pixels] f(x) "cat" ││ "Translate this" f(x) "Bonjour" ││││ The function has PARAMETERS—numbers that determine ││ how inputs map to outputs. Training adjusts them. │││└───────────────────────────────────────────────────────────┘UNTRAINED VS TRAINED┌───────────────────────────────────────────────────────────┐│││ Untrained model: random parameters→garbage output││ Trained model: learned parameters→useful output│││└───────────────────────────────────────────────────────────┘
Parameters, Weights, and Biases
Parameters are the learnable values inside a model.
Parameters and Scale
Parameters and Scale
SIMPLE LINEAR MODEL┌───────────────────────────────────────────────────────────┐│││ y = w × x + b ││││ w = weight (how much x matters) ││ b = bias (baseline offset) ││││ These are parameters. Training finds good values. │││└───────────────────────────────────────────────────────────┘MODERN LLM SCALE┌───────────────────────────────────────────────────────────┐│││GPT-2: 124 million parameters││GPT-3: 175 billion parameters││LLaMA 70B: 70 billion parameters││Claude: undisclosed, but similar scale ││││ Each parameter is a floating-point number. ││ More parameters = more capacity to learn patterns. │││└───────────────────────────────────────────────────────────┘
Logits
Logits are raw, unnormalized scores output by a model before converting to probabilities.
Logits to Probabilities
Logits to Probabilities
LOGITS TO PROBABILITIES┌───────────────────────────────────────────────────────────┐│││ Model outputs logits: ││ "cat": 4.2 ││ "dog": 2.1 ││ "car": -1.3 ││││ These are arbitrary numbers. Higher = more likely. ││││ Apply softmax to convert to probabilities: ││ "cat": 0.89 ││ "dog": 0.10 ││ "car": 0.01 ││││ Now they sum to 1 and represent confidence. │││└───────────────────────────────────────────────────────────┘
Why logits matter:
LLMs output logits over vocabulary (50,000+ scores)
THREE TYPES OF SYSTEMS┌───────────────────────────────────────────────────────────┐│││DETERMINISTIC: Same input →same output, always││││ def deterministic(x): ││ return x * 2 + 5 ││││ deterministic(3) # Always 11 │││├───────────────────────────────────────────────────────────┤│││PROBABILISTIC: Output includes randomness││││ def probabilistic(x): ││ return x * 2 + 5 + random.gauss(0, 1) ││││ probabilistic(3) # 11.23 one time, 10.87 next │││├───────────────────────────────────────────────────────────┤│││STATISTICAL: Learns patterns from data ││││ model = LinearRegression() ││ model.fit(X_train, y_train) # Learns from data││ model.predict(X_new) # Generalizes│││└───────────────────────────────────────────────────────────┘
ML models are statistical systems that often use probabilistic methods:
They learn from data (statistical)
They may sample from distributions (probabilistic)
Given same input + same random seed, they’re deterministic
LLMs with temperature > 0 are probabilistic. With temperature = 0, they’re deterministic.
Training vs Inference
Training vs Inference
Training vs Inference
TRAINING┌───────────────────────────────────────────────────────────┐│││Training loop: ││ 1. Forward pass: compute prediction││ 2. Compute loss: how wrong is it? ││ 3. Backward pass: compute gradients││ 4. Update parameters: reduce error ││ 5. Repeat millions of times ││││ Training is: ││ • Expensive (weeks on GPU clusters) ││ • Done once (or periodically) ││ • Requires labeled data │││└───────────────────────────────────────────────────────────┘INFERENCE┌───────────────────────────────────────────────────────────┐│││ Inference: ││ 1. Load trained parameters││ 2. Forward pass only ││ 3. Return prediction ││││ Inference is: ││ • Cheap (milliseconds per prediction) ││ • Done continuously in production ││ • No parameter updates │││└───────────────────────────────────────────────────────────┘
You will mostly do inference. Training LLMs requires massive compute. Fine-tuning is more accessible but still expensive.
Why Neural Networks Work
Neural networks are function approximators. Given enough parameters, they can learn any pattern in data.
Why Neural Networks Work
Why Neural Networks Work
UNIVERSAL APPROXIMATION┌───────────────────────────────────────────────────────────┐│││ Theorem: A neural network with sufficient neurons ││ can approximate any continuous function to arbitrary ││ precision. ││││ Translation: If a pattern exists in your data, ││ a big enough network can learn it. │││└───────────────────────────────────────────────────────────┘WHY DEPTH MATTERS┌───────────────────────────────────────────────────────────┐│││Shallow network (1-2 layers): ││ • Can approximate functions ││ • Needs exponentially many neurons ││││Deep network (many layers): ││ • Learns hierarchical features││ • Layer 1: edges ││ • Layer 2: shapes ││ • Layer 3: objects ││ • Layer 4: scenes ││││ Each layer builds on previous. Composition is powerful. │││└───────────────────────────────────────────────────────────┘
The Bias-Variance Tradeoff
Two failure modes when learning from data:
Bias-Variance Tradeoff
Bias-Variance Tradeoff
UNDERFITTING (HIGH BIAS)
┌───────────────────────────────────────────────────────────┐│││Training error: High ││Test error: High ││ Problem: Model too simple, misses patterns││ Solution: More capacity (bigger model, more features) │││└───────────────────────────────────────────────────────────┘OVERFITTING (HIGH VARIANCE)
┌───────────────────────────────────────────────────────────┐│││Training error: Low (perfect!) ││Test error: High (fails on new data) ││ Problem: Model memorized noise, not signal ││ Solution: Regularization, more data, simpler model │││└───────────────────────────────────────────────────────────┘THE SWEET SPOT┌───────────────────────────────────────────────────────────┐││││││ Error │ / Test error│││ / │││ / │││ / ────────Sweet spot│││ / │││ / Training error│││─────────────────────────────────││ Model Complexity →│││└───────────────────────────────────────────────────────────┘
Vocabulary Reference
Term
Definition
Model
Mathematical function mapping inputs to outputs
Parameters
Learnable values inside the model (weights + biases)
Weights
Parameters in connections between neurons
Bias
Offset parameter added to weighted sum
Features
Measurable properties of input data
Logits
Raw, unnormalized output scores
Softmax
Converts logits to probabilities (sum to 1)
Training
Adjusting parameters to minimize error
Inference
Using trained model to make predictions
Loss
Measure of how wrong predictions are
Gradient
Direction to adjust parameters to reduce loss
Epoch
One pass through entire training dataset
When This Matters
Situation
What to know
Discussing model size
Parameters = capacity, larger = more memory
Debugging generation
Temperature affects logit sampling
Understanding training
Forward pass -> loss -> backward pass -> update
Production deployment
Inference only, no training overhead
Model selection
Bias-variance tradeoff guides complexity choice
Interview Notes
💼80% of ML interviews
Interview Relevance 80% of ML interviews
🏭Every ML system
Production Impact
Powers systems at Every ML system
⚡Foundation for all AI work
Performance Foundation for all AI work query improvement