TL;DR
GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ CPU: 8-64 powerful cores │ │ Good at complex, sequential tasks │ │ Each core handles different work │ │ │ │ GPU: 1000-10000+ simple cores │ │ Good at simple, parallel tasks │ │ All cores do same operation on different data │ │ │ │ Matrix multiplication (core ML operation): │ │ CPU: Compute elements one by one (or few at a time) │ │ GPU: Compute thousands of elements simultaneously │ │ │ │ Result: GPUs are 10-100x faster for ML workloads. │ │ │ └───────────────────────────────────────────────────────────┘
Key GPU specs:
| Spec | What it means | Why it matters |
|---|---|---|
| CUDA cores | Number of parallel processors | More = faster |
| VRAM | Video memory | Limits model size |
| Memory bandwidth | Data transfer speed | Limits throughput |
| Tensor cores | Specialized matrix units | 2-4x faster for ML |
VRAM (Video RAM)
GPU memory is separate from system RAM. Models must fit in VRAM.
WHAT USES VRAM ┌───────────────────────────────────────────────────────────┐ │ │ │ Training: │ │ VRAM = Model + Activations + Gradients + Optimizer │ │ │ │ Inference: │ │ VRAM = Model + Activations (+ KV cache for LLMs) │ │ │ └───────────────────────────────────────────────────────────┘ TRAINING MEMORY (FP16) ┌───────────────────────────────────────────────────────────┐ │ │ │ 7B parameter model: │ │ │ │ Model weights: 7B × 2 bytes = 14 GB │ │ Gradients: 7B × 2 bytes = 14 GB │ │ Optimizer (Adam): 7B × 8 bytes = 56 GB │ │ • m (momentum): 7B × 2 bytes │ │ • v (variance): 7B × 2 bytes │ │ • Master weights FP32: 7B × 4 bytes │ │ │ │ Total just for params: ~84 GB │ │ │ │ Activations (batch-dependent): Additional 10-50+ GB │ │ │ │ Training 7B model needs: 80-100+ GB VRAM │ │ → Requires A100 80GB or multi-GPU │ │ │ └───────────────────────────────────────────────────────────┘ INFERENCE MEMORY ┌───────────────────────────────────────────────────────────┐ │ │ │ 7B parameter model: │ │ │ │ FP16: 7B × 2 bytes = 14 GB │ │ INT8: 7B × 1 byte = 7 GB │ │ INT4: 7B × 0.5 bytes = 3.5 GB │ │ │ │ Plus KV cache (grows with context): │ │ Per token: ~0.5-2 MB depending on model │ │ 8K context: 4-16 GB additional │ │ │ │ FP16 7B with 8K context: ~20-30 GB │ │ INT4 7B with 8K context: ~8-12 GB │ │ │ └───────────────────────────────────────────────────────────┘
Precision Formats
Different number formats trade accuracy for memory/speed.
┌───────────────────────────────────────────────────────────┐ │ │ │ FP32 (32-bit float): │ │ • 4 bytes per parameter │ │ • Full precision, baseline quality │ │ • Slowest, most memory │ │ │ │ FP16 (16-bit float): │ │ • 2 bytes per parameter │ │ • Slight precision loss │ │ • 2x faster, half memory │ │ • Can overflow (limited range) │ │ │ │ BF16 (bfloat16): │ │ • 2 bytes per parameter │ │ • Same range as FP32, less precision │ │ • Better for training than FP16 │ │ • Supported on newer GPUs (A100+) │ │ │ │ INT8 (8-bit integer): │ │ • 1 byte per parameter │ │ • Quantized (needs calibration) │ │ • 4x memory savings vs FP32 │ │ • ~1% quality loss typically │ │ │ │ INT4 (4-bit integer): │ │ • 0.5 bytes per parameter │ │ • Aggressive quantization │ │ • 8x memory savings vs FP32 │ │ • 1-3% quality loss typically │ │ │ └───────────────────────────────────────────────────────────┘ MEMORY PER 1B PARAMETERS ┌───────────────────────────────────────────────────────────┐ │ │ │ FP32 ════════════════════════════════ 4.0 GB │ │ FP16 ════════════════════ 2.0 GB │ │ BF16 ════════════════════ 2.0 GB │ │ INT8 ════════════ 1.0 GB │ │ INT4 ══════ 0.5 GB │ │ │ └───────────────────────────────────────────────────────────┘
Quantization
Converting weights from high precision to lower precision.
┌───────────────────────────────────────────────────────────┐ │ │ │ Original weight (FP16): 0.0234375 │ │ │ │ Quantize to INT8: │ │ 1. Find range: [min_weight, max_weight] = [-1.0, 1.0] │ │ 2. Map to INT8 range: [-128, 127] │ │ 3. scale = (max - min) / 255 = 0.00784 │ │ 4. quantized = round(weight / scale) = round(2.99) = 3│ │ │ │ Dequantize: │ │ weight ≈ 3 × 0.00784 = 0.02352 (close to original!) │ │ │ └───────────────────────────────────────────────────────────┘
Quantization Methods
POST-TRAINING QUANTIZATION (PTQ) ┌───────────────────────────────────────────────────────────┐ │ │ │ 1. Train model normally (FP16/FP32) │ │ 2. After training, quantize weights │ │ 3. Calibrate with sample data │ │ │ │ Pros: Simple, no retraining │ │ Cons: May lose quality for aggressive quantization │ │ │ │ Used by: GPTQ, AWQ, most deployment tools │ │ │ └───────────────────────────────────────────────────────────┘ QUANTIZATION-AWARE TRAINING (QAT) ┌───────────────────────────────────────────────────────────┐ │ │ │ 1. Train with simulated quantization │ │ 2. Model learns to be robust to quantization noise │ │ 3. Final weights naturally quantize well │ │ │ │ Pros: Better quality at low precision │ │ Cons: Requires training, more complex │ │ │ │ Used by: When PTQ quality is insufficient │ │ │ └───────────────────────────────────────────────────────────┘
Common Quantization Formats
| Format | Description | Quality | Use case |
|---|---|---|---|
| GPTQ | 4-bit, row-wise | Good | GPU inference |
| AWQ | 4-bit, activation-aware | Better | GPU inference |
| GGUF | Various bits, CPU-friendly | Good | CPU/Mac inference |
| bitsandbytes | 4/8-bit, dynamic | Good | Training + inference |
Practical GPU Selection
┌───────────────────────────────────────────────────────────┐ │ │ │ INFERENCE (running models): │ │ │ │ 7B model: │ │ FP16: Need 16+ GB → RTX 4090, A10, L4 │ │ INT4: Need 8+ GB → RTX 3090, 4080 │ │ │ │ 13B model: │ │ FP16: Need 30+ GB → A100 40GB, A6000 │ │ INT4: Need 12+ GB → RTX 4090 │ │ │ │ 70B model: │ │ FP16: Need 140+ GB → 2x A100 80GB │ │ INT4: Need 40+ GB → A100 40GB, 2x RTX 4090 │ │ │ │ │ │ TRAINING/FINE-TUNING: │ │ │ │ 7B full fine-tune: 80+ GB → A100 80GB │ │ 7B LoRA fine-tune: 20+ GB → RTX 4090, A10 │ │ 13B LoRA: 30+ GB → A100 40GB, A6000 │ │ │ └───────────────────────────────────────────────────────────┘
Common GPUs
| GPU | VRAM | Good for |
|---|---|---|
| RTX 3090 | 24 GB | Dev, INT4 inference |
| RTX 4090 | 24 GB | Dev, LoRA fine-tuning |
| A10 | 24 GB | Cloud inference |
| L4 | 24 GB | Cloud inference (efficient) |
| A100 40GB | 40 GB | Training, large inference |
| A100 80GB | 80 GB | Large model training |
| H100 80GB | 80 GB | Fastest training |
Debugging Memory Issues
OUT OF MEMORY (OOM) ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • CUDA out of memory error │ │ • Training crashes │ │ │ │ Immediate fixes: │ │ 1. Reduce batch size │ │ 2. Use gradient accumulation │ │ 3. Enable gradient checkpointing │ │ 4. Use mixed precision (FP16/BF16) │ │ │ │ Longer-term fixes: │ │ 1. Use quantization (INT8/INT4) │ │ 2. Use LoRA instead of full fine-tuning │ │ 3. Get more VRAM │ │ │ └───────────────────────────────────────────────────────────┘ MEMORY KEEPS GROWING ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Memory usage increases over time │ │ • Eventually OOM │ │ │ │ Causes: │ │ • Not clearing cache │ │ • Storing too much history │ │ • KV cache not managed │ │ │ │ Debug steps: │ │ 1. torch.cuda.empty_cache() between batches │ │ 2. del intermediate tensors │ │ 3. For inference: limit context length or use │ │ sliding window │ │ │ └───────────────────────────────────────────────────────────┘
When This Matters
| Situation | What to know |
|---|---|
| Choosing GPU for inference | Model size x precision = VRAM needed |
| Running 7B locally | INT4 quantization, ~8GB needed |
| Training models | Need 4-6x model size in VRAM |
| Fine-tuning on consumer GPU | Use LoRA + INT8/INT4 |
| Getting OOM errors | Reduce batch, use gradient accumulation |
| Understanding model cards | Check precision (FP16, INT4, etc.) |
| Cost optimization | Smaller precision = cheaper inference |
| Understanding quantization | INT4 ~ 1-3% quality loss, 8x savings |
Production signal
Why this concept matters
Interview 65% of ML infrastructure interviews
Production Model deployment and cost
Performance INT4 = 8x memory savings vs FP32