Memory & Compute | Concepts

TL;DR

GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.

Visual Overview

GPU vs CPU for ML

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   CPU: 8-64 powerful cores                                │
│        Good at complex, sequential tasks                  │
│        Each core handles different work                   │
│                                                           │
│   GPU: 1000-10000+ simple cores                           │
│        Good at simple, parallel tasks                     │
│        All cores do same operation on different data      │
│                                                           │
│   Matrix multiplication (core ML operation):              │
│     CPU: Compute elements one by one (or few at a time)   │
│     GPU: Compute thousands of elements simultaneously     │
│                                                           │
│   Result: GPUs are 10-100x faster for ML workloads.       │
│                                                           │
└───────────────────────────────────────────────────────────┘

Key GPU specs:

Spec	What it means	Why it matters
CUDA cores	Number of parallel processors	More = faster
VRAM	Video memory	Limits model size
Memory bandwidth	Data transfer speed	Limits throughput
Tensor cores	Specialized matrix units	2-4x faster for ML

VRAM (Video RAM)

GPU memory is separate from system RAM. Models must fit in VRAM.

VRAM Usage

WHAT USES VRAM
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Training:                                               │
│     VRAM = Model + Activations + Gradients + Optimizer    │
│                                                           │
│   Inference:                                              │
│     VRAM = Model + Activations (+ KV cache for LLMs)      │
│                                                           │
└───────────────────────────────────────────────────────────┘

TRAINING MEMORY (FP16)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ 7B parameter model:                                       │
│                                                           │
│ Model weights: 7B × 2 bytes = 14 GB                       │
│ Gradients: 7B × 2 bytes = 14 GB                           │
│ Optimizer (Adam): 7B × 8 bytes = 56 GB                    │
│ • m (momentum): 7B × 2 bytes                              │
│ • v (variance): 7B × 2 bytes                              │
│ • Master weights FP32: 7B × 4 bytes                       │
│                                                           │
│ Total just for params: ~84 GB                             │
│                                                           │
│ Activations (batch-dependent): Additional 10-50+ GB       │
│                                                           │
│ Training 7B model needs: 80-100+ GB VRAM                  │
│ → Requires A100 80GB or multi-GPU                         │
│                                                           │
└───────────────────────────────────────────────────────────┘

INFERENCE MEMORY
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ 7B parameter model:                                       │
│                                                           │
│ FP16: 7B × 2 bytes = 14 GB                                │
│ INT8: 7B × 1 byte = 7 GB                                  │
│ INT4: 7B × 0.5 bytes = 3.5 GB                             │
│                                                           │
│ Plus KV cache (grows with context):                       │
│ Per token: ~0.5-2 MB depending on model                   │
│ 8K context: 4-16 GB additional                            │
│                                                           │
│ FP16 7B with 8K context: ~20-30 GB                        │
│ INT4 7B with 8K context: ~8-12 GB                         │
│                                                           │
└───────────────────────────────────────────────────────────┘

Precision Formats

Different number formats trade accuracy for memory/speed.

Precision Formats

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   FP32 (32-bit float):                                    │
│     • 4 bytes per parameter                               │
│     • Full precision, baseline quality                    │
│     • Slowest, most memory                                │
│                                                           │
│   FP16 (16-bit float):                                    │
│     • 2 bytes per parameter                               │
│     • Slight precision loss                               │
│     • 2x faster, half memory                              │
│     • Can overflow (limited range)                        │
│                                                           │
│   BF16 (bfloat16):                                        │
│     • 2 bytes per parameter                               │
│     • Same range as FP32, less precision                  │
│     • Better for training than FP16                       │
│     • Supported on newer GPUs (A100+)                     │
│                                                           │
│   INT8 (8-bit integer):                                   │
│     • 1 byte per parameter                                │
│     • Quantized (needs calibration)                       │
│     • 4x memory savings vs FP32                           │
│     • ~1% quality loss typically                          │
│                                                           │
│   INT4 (4-bit integer):                                   │
│     • 0.5 bytes per parameter                             │
│     • Aggressive quantization                             │
│     • 8x memory savings vs FP32                           │
│     • 1-3% quality loss typically                         │
│                                                           │
└───────────────────────────────────────────────────────────┘

MEMORY PER 1B PARAMETERS
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ FP32 ════════════════════════════════ 4.0 GB              │
│ FP16 ════════════════════ 2.0 GB                          │
│ BF16 ════════════════════ 2.0 GB                          │
│ INT8 ════════════ 1.0 GB                                  │
│ INT4 ══════ 0.5 GB                                        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Quantization

Converting weights from high precision to lower precision.

Quantization Basics

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Original weight (FP16): 0.0234375                       │
│                                                           │
│   Quantize to INT8:                                       │
│     1. Find range: [min_weight, max_weight] = [-1.0, 1.0] │
│     2. Map to INT8 range: [-128, 127]                     │
│     3. scale = (max - min) / 255 = 0.00784                │
│     4. quantized = round(weight / scale) = round(2.99) = 3│
│                                                           │
│   Dequantize:                                             │
│     weight ≈ 3 × 0.00784 = 0.02352  (close to original!)  │
│                                                           │
└───────────────────────────────────────────────────────────┘

Quantization Methods

POST-TRAINING QUANTIZATION (PTQ)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   1. Train model normally (FP16/FP32)                     │
│   2. After training, quantize weights                     │
│   3. Calibrate with sample data                           │
│                                                           │
│   Pros: Simple, no retraining                             │
│   Cons: May lose quality for aggressive quantization      │
│                                                           │
│   Used by: GPTQ, AWQ, most deployment tools               │
│                                                           │
└───────────────────────────────────────────────────────────┘

QUANTIZATION-AWARE TRAINING (QAT)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ 1. Train with simulated quantization                      │
│ 2. Model learns to be robust to quantization noise        │
│ 3. Final weights naturally quantize well                  │
│                                                           │
│ Pros: Better quality at low precision                     │
│ Cons: Requires training, more complex                     │
│                                                           │
│ Used by: When PTQ quality is insufficient                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

Common Quantization Formats

Format	Description	Quality	Use case
GPTQ	4-bit, row-wise	Good	GPU inference
AWQ	4-bit, activation-aware	Better	GPU inference
GGUF	Various bits, CPU-friendly	Good	CPU/Mac inference
bitsandbytes	4/8-bit, dynamic	Good	Training + inference

Practical GPU Selection

GPU Selection Guide

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   INFERENCE (running models):                             │
│                                                           │
│   7B model:                                               │
│     FP16: Need 16+ GB → RTX 4090, A10, L4                 │
│     INT4: Need 8+ GB  → RTX 3090, 4080                    │
│                                                           │
│   13B model:                                              │
│     FP16: Need 30+ GB → A100 40GB, A6000                  │
│     INT4: Need 12+ GB → RTX 4090                          │
│                                                           │
│   70B model:                                              │
│     FP16: Need 140+ GB → 2x A100 80GB                     │
│     INT4: Need 40+ GB  → A100 40GB, 2x RTX 4090           │
│                                                           │
│                                                           │
│   TRAINING/FINE-TUNING:                                   │
│                                                           │
│   7B full fine-tune: 80+ GB → A100 80GB                   │
│   7B LoRA fine-tune: 20+ GB → RTX 4090, A10               │
│   13B LoRA: 30+ GB → A100 40GB, A6000                     │
│                                                           │
└───────────────────────────────────────────────────────────┘

Common GPUs

GPU	VRAM	Good for
RTX 3090	24 GB	Dev, INT4 inference
RTX 4090	24 GB	Dev, LoRA fine-tuning
A10	24 GB	Cloud inference
L4	24 GB	Cloud inference (efficient)
A100 40GB	40 GB	Training, large inference
A100 80GB	80 GB	Large model training
H100 80GB	80 GB	Fastest training

Debugging Memory Issues

OUT OF MEMORY (OOM)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Symptoms:                                               │
│     • CUDA out of memory error                            │
│     • Training crashes                                    │
│                                                           │
│   Immediate fixes:                                        │
│     1. Reduce batch size                                  │
│     2. Use gradient accumulation                          │
│     3. Enable gradient checkpointing                      │
│     4. Use mixed precision (FP16/BF16)                    │
│                                                           │
│   Longer-term fixes:                                      │
│     1. Use quantization (INT8/INT4)                       │
│     2. Use LoRA instead of full fine-tuning               │
│     3. Get more VRAM                                      │
│                                                           │
└───────────────────────────────────────────────────────────┘

MEMORY KEEPS GROWING
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Symptoms:                                                 │
│ • Memory usage increases over time                        │
│ • Eventually OOM                                          │
│                                                           │
│ Causes:                                                   │
│ • Not clearing cache                                      │
│ • Storing too much history                                │
│ • KV cache not managed                                    │
│                                                           │
│ Debug steps:                                              │
│ 1. torch.cuda.empty_cache() between batches               │
│ 2. del intermediate tensors                               │
│ 3. For inference: limit context length or use             │
│ sliding window                                            │
│                                                           │
└───────────────────────────────────────────────────────────┘

When This Matters

Situation	What to know
Choosing GPU for inference	Model size x precision = VRAM needed
Running 7B locally	INT4 quantization, ~8GB needed
Training models	Need 4-6x model size in VRAM
Fine-tuning on consumer GPU	Use LoRA + INT8/INT4
Getting OOM errors	Reduce batch, use gradient accumulation
Understanding model cards	Check precision (FP16, INT4, etc.)
Cost optimization	Smaller precision = cheaper inference
Understanding quantization	INT4 ~ 1-3% quality loss, 8x savings

TL;DR

Visual Overview

VRAM (Video RAM)

Precision Formats

Quantization

Quantization Methods

Common Quantization Formats

Practical GPU Selection

Common GPUs

Debugging Memory Issues

When This Matters

Why this concept matters