TL;DR
A machine learning model is a mathematical function that maps inputs to outputs, with learnable parameters that are adjusted during training. Understanding parameters, logits, training vs inference, and the bias-variance tradeoff is essential vocabulary for any AI engineering work.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ Input (X) → Model → Output (Y) │ │ ───────────────────────────────────────────────── │ │ "Is this spam?" f(x) 0.92 (yes) │ │ [image pixels] f(x) "cat" │ │ "Translate this" f(x) "Bonjour" │ │ │ │ The function has PARAMETERS—numbers that determine │ │ how inputs map to outputs. Training adjusts them. │ │ │ └───────────────────────────────────────────────────────────┘ UNTRAINED VS TRAINED ┌───────────────────────────────────────────────────────────┐ │ │ │ Untrained model: random parameters → garbage output │ │ Trained model: learned parameters → useful output │ │ │ └───────────────────────────────────────────────────────────┘
Parameters, Weights, and Biases
Parameters are the learnable values inside a model.
SIMPLE LINEAR MODEL ┌───────────────────────────────────────────────────────────┐ │ │ │ y = w × x + b │ │ │ │ w = weight (how much x matters) │ │ b = bias (baseline offset) │ │ │ │ These are parameters. Training finds good values. │ │ │ └───────────────────────────────────────────────────────────┘ MODERN LLM SCALE ┌───────────────────────────────────────────────────────────┐ │ │ │ GPT-2: 124 million parameters │ │ GPT-3: 175 billion parameters │ │ LLaMA 70B: 70 billion parameters │ │ Claude: undisclosed, but similar scale │ │ │ │ Each parameter is a floating-point number. │ │ More parameters = more capacity to learn patterns. │ │ │ └───────────────────────────────────────────────────────────┘
Logits
Logits are raw, unnormalized scores output by a model before converting to probabilities.
┌───────────────────────────────────────────────────────────┐ │ │ │ Model outputs logits: │ │ "cat": 4.2 │ │ "dog": 2.1 │ │ "car": -1.3 │ │ │ │ These are arbitrary numbers. Higher = more likely. │ │ │ │ Apply softmax to convert to probabilities: │ │ "cat": 0.89 │ │ "dog": 0.10 │ │ "car": 0.01 │ │ │ │ Now they sum to 1 and represent confidence. │ │ │ └───────────────────────────────────────────────────────────┘
Why logits matter:
- LLMs output logits over vocabulary (50,000+ scores)
- Temperature and sampling operate on logits
- Understanding logits helps debug generation issues
Deterministic vs Probabilistic vs Statistical
┌───────────────────────────────────────────────────────────┐ │ │ │ DETERMINISTIC: Same input → same output, always │ │ │ │ def deterministic(x): │ │ return x * 2 + 5 │ │ │ │ deterministic(3) # Always 11 │ │ │ ├───────────────────────────────────────────────────────────┤ │ │ │ PROBABILISTIC: Output includes randomness │ │ │ │ def probabilistic(x): │ │ return x * 2 + 5 + random.gauss(0, 1) │ │ │ │ probabilistic(3) # 11.23 one time, 10.87 next │ │ │ ├───────────────────────────────────────────────────────────┤ │ │ │ STATISTICAL: Learns patterns from data │ │ │ │ model = LinearRegression() │ │ model.fit(X_train, y_train) # Learns from data │ │ model.predict(X_new) # Generalizes │ │ │ └───────────────────────────────────────────────────────────┘
ML models are statistical systems that often use probabilistic methods:
- They learn from data (statistical)
- They may sample from distributions (probabilistic)
- Given same input + same random seed, they’re deterministic
LLMs with temperature > 0 are probabilistic. With temperature = 0, they’re deterministic.
Training vs Inference
TRAINING ┌───────────────────────────────────────────────────────────┐ │ │ │ Training loop: │ │ 1. Forward pass: compute prediction │ │ 2. Compute loss: how wrong is it? │ │ 3. Backward pass: compute gradients │ │ 4. Update parameters: reduce error │ │ 5. Repeat millions of times │ │ │ │ Training is: │ │ • Expensive (weeks on GPU clusters) │ │ • Done once (or periodically) │ │ • Requires labeled data │ │ │ └───────────────────────────────────────────────────────────┘ INFERENCE ┌───────────────────────────────────────────────────────────┐ │ │ │ Inference: │ │ 1. Load trained parameters │ │ 2. Forward pass only │ │ 3. Return prediction │ │ │ │ Inference is: │ │ • Cheap (milliseconds per prediction) │ │ • Done continuously in production │ │ • No parameter updates │ │ │ └───────────────────────────────────────────────────────────┘
You will mostly do inference. Training LLMs requires massive compute. Fine-tuning is more accessible but still expensive.
Why Neural Networks Work
Neural networks are function approximators. Given enough parameters, they can learn any pattern in data.
UNIVERSAL APPROXIMATION ┌───────────────────────────────────────────────────────────┐ │ │ │ Theorem: A neural network with sufficient neurons │ │ can approximate any continuous function to arbitrary │ │ precision. │ │ │ │ Translation: If a pattern exists in your data, │ │ a big enough network can learn it. │ │ │ └───────────────────────────────────────────────────────────┘ WHY DEPTH MATTERS ┌───────────────────────────────────────────────────────────┐ │ │ │ Shallow network (1-2 layers): │ │ • Can approximate functions │ │ • Needs exponentially many neurons │ │ │ │ Deep network (many layers): │ │ • Learns hierarchical features │ │ • Layer 1: edges │ │ • Layer 2: shapes │ │ • Layer 3: objects │ │ • Layer 4: scenes │ │ │ │ Each layer builds on previous. Composition is powerful. │ │ │ └───────────────────────────────────────────────────────────┘
The Bias-Variance Tradeoff
Two failure modes when learning from data:
UNDERFITTING (HIGH BIAS) ┌───────────────────────────────────────────────────────────┐ │ │ │ Training error: High │ │ Test error: High │ │ Problem: Model too simple, misses patterns │ │ Solution: More capacity (bigger model, more features) │ │ │ └───────────────────────────────────────────────────────────┘ OVERFITTING (HIGH VARIANCE) ┌───────────────────────────────────────────────────────────┐ │ │ │ Training error: Low (perfect!) │ │ Test error: High (fails on new data) │ │ Problem: Model memorized noise, not signal │ │ Solution: Regularization, more data, simpler model │ │ │ └───────────────────────────────────────────────────────────┘ THE SWEET SPOT ┌───────────────────────────────────────────────────────────┐ │ │ │ │ │ │ Error │ / Test error │ │ │ / │ │ │ / │ │ │ / ──────── Sweet spot │ │ │ / │ │ │ / Training error │ │ │───────────────────────────────── │ │ Model Complexity → │ │ │ └───────────────────────────────────────────────────────────┘
Vocabulary Reference
| Term | Definition |
|---|---|
| Model | Mathematical function mapping inputs to outputs |
| Parameters | Learnable values inside the model (weights + biases) |
| Weights | Parameters in connections between neurons |
| Bias | Offset parameter added to weighted sum |
| Features | Measurable properties of input data |
| Logits | Raw, unnormalized output scores |
| Softmax | Converts logits to probabilities (sum to 1) |
| Training | Adjusting parameters to minimize error |
| Inference | Using trained model to make predictions |
| Loss | Measure of how wrong predictions are |
| Gradient | Direction to adjust parameters to reduce loss |
| Epoch | One pass through entire training dataset |
When This Matters
| Situation | What to know |
|---|---|
| Discussing model size | Parameters = capacity, larger = more memory |
| Debugging generation | Temperature affects logit sampling |
| Understanding training | Forward pass -> loss -> backward pass -> update |
| Production deployment | Inference only, no training overhead |
| Model selection | Bias-variance tradeoff guides complexity choice |
Production signal