I/D/E · Generative AI

Memory & Compute

Summary

GPU memory, precision formats, quantization (INT4/INT8), and practical GPU selection for LLMs

TL;DR

GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.

Visual Overview

GPU vs CPU for ML

                                                           
   CPU: 8-64 powerful cores                                
        Good at complex, sequential tasks                  
        Each core handles different work                   
                                                           
   GPU: 1000-10000+ simple cores                           
        Good at simple, parallel tasks                     
        All cores do same operation on different data      
                                                           
   Matrix multiplication (core ML operation):              
     CPU: Compute elements one by one (or few at a time)   
     GPU: Compute thousands of elements simultaneously     
                                                           
   Result: GPUs are 10-100x faster for ML workloads.       
                                                           

Key GPU specs:

SpecWhat it meansWhy it matters
CUDA coresNumber of parallel processorsMore = faster
VRAMVideo memoryLimits model size
Memory bandwidthData transfer speedLimits throughput
Tensor coresSpecialized matrix units2-4x faster for ML

VRAM (Video RAM)

GPU memory is separate from system RAM. Models must fit in VRAM.

VRAM Usage
WHAT USES VRAM

                                                           
   Training:                                               
     VRAM = Model + Activations + Gradients + Optimizer    
                                                           
   Inference:                                              
     VRAM = Model + Activations (+ KV cache for LLMs)      
                                                           


TRAINING MEMORY (FP16)

 
 7B parameter model: 
 
 Model weights: 7B × 2 bytes = 14 GB 
 Gradients: 7B × 2 bytes = 14 GB 
 Optimizer (Adam): 7B × 8 bytes = 56 GB 
 • m (momentum): 7B × 2 bytes 
 • v (variance): 7B × 2 bytes 
 • Master weights FP32: 7B × 4 bytes 
 
 Total just for params: ~84 GB 
 
 Activations (batch-dependent): Additional 10-50+ GB 
 
 Training 7B model needs: 80-100+ GB VRAM 
  Requires A100 80GB or multi-GPU 
 


INFERENCE MEMORY

 
 7B parameter model: 
 
 FP16: 7B × 2 bytes = 14 GB 
 INT8: 7B × 1 byte = 7 GB 
 INT4: 7B × 0.5 bytes = 3.5 GB 
 
 Plus KV cache (grows with context): 
 Per token: ~0.5-2 MB depending on model 
 8K context: 4-16 GB additional 
 
 FP16 7B with 8K context: ~20-30 GB 
 INT4 7B with 8K context: ~8-12 GB 
 


Precision Formats

Different number formats trade accuracy for memory/speed.

Precision Formats

                                                           
   FP32 (32-bit float):                                    
     • 4 bytes per parameter                               
     • Full precision, baseline quality                    
Slowest, most memory                                
                                                           
   FP16 (16-bit float):                                    
     • 2 bytes per parameter                               
     • Slight precision loss                               
2x faster, half memory                              
Can overflow (limited range)                        
                                                           
   BF16 (bfloat16):                                        
     • 2 bytes per parameter                               
     • Same range as FP32, less precision                  
     • Better for training than FP16                       
     • Supported on newer GPUs (A100+)                     
                                                           
   INT8 (8-bit integer):                                   
     • 1 byte per parameter                                
     • Quantized (needs calibration)                       
4x memory savings vs FP32                           
     • ~1% quality loss typically                          
                                                           
   INT4 (4-bit integer):                                   
     • 0.5 bytes per parameter                             
     • Aggressive quantization                             
8x memory savings vs FP32                           
     • 1-3% quality loss typically                         
                                                           


MEMORY PER 1B PARAMETERS

 
 FP32  4.0 GB 
 FP16  2.0 GB 
 BF16  2.0 GB 
 INT8  1.0 GB 
 INT4  0.5 GB 
 


Quantization

Converting weights from high precision to lower precision.

Quantization Basics

                                                           
   Original weight (FP16): 0.0234375                       
                                                           
   Quantize to INT8:                                       
     1. Find range: [min_weight, max_weight] = [-1.0, 1.0] 
     2. Map to INT8 range: [-128, 127]                     
     3. scale = (max - min) / 255 = 0.00784                
     4. quantized = round(weight / scale) = round(2.99) = 3
                                                           
   Dequantize:                                             
     weight ≈ 3 × 0.00784 = 0.02352  (close to original!)  
                                                           

Quantization Methods

Quantization Methods
POST-TRAINING QUANTIZATION (PTQ)

                                                           
   1. Train model normally (FP16/FP32)                     
   2. After training, quantize weights                     
   3. Calibrate with sample data                           
                                                           
   Pros: Simple, no retraining                             
   Cons: May lose quality for aggressive quantization      
                                                           
   Used by: GPTQ, AWQ, most deployment tools               
                                                           


QUANTIZATION-AWARE TRAINING (QAT)

 
 1. Train with simulated quantization 
 2. Model learns to be robust to quantization noise 
 3. Final weights naturally quantize well 
 
 Pros: Better quality at low precision 
 Cons: Requires training, more complex 
 
 Used by: When PTQ quality is insufficient 
 

Common Quantization Formats

FormatDescriptionQualityUse case
GPTQ4-bit, row-wiseGoodGPU inference
AWQ4-bit, activation-awareBetterGPU inference
GGUFVarious bits, CPU-friendlyGoodCPU/Mac inference
bitsandbytes4/8-bit, dynamicGoodTraining + inference

Practical GPU Selection

GPU Selection Guide

                                                           
   INFERENCE (running models):                             
                                                           
   7B model:                                               
     FP16: Need 16+ GB  RTX 4090, A10, L4                 
     INT4: Need 8+ GB   RTX 3090, 4080                    
                                                           
   13B model:                                              
     FP16: Need 30+ GB  A100 40GB, A6000                  
     INT4: Need 12+ GB  RTX 4090                          
                                                           
   70B model:                                              
     FP16: Need 140+ GB  2x A100 80GB                     
     INT4: Need 40+ GB   A100 40GB, 2x RTX 4090           
                                                           
                                                           
   TRAINING/FINE-TUNING:                                   
                                                           
   7B full fine-tune: 80+ GB  A100 80GB                   
   7B LoRA fine-tune: 20+ GB  RTX 4090, A10               
   13B LoRA: 30+ GB  A100 40GB, A6000                     
                                                           

Common GPUs

GPUVRAMGood for
RTX 309024 GBDev, INT4 inference
RTX 409024 GBDev, LoRA fine-tuning
A1024 GBCloud inference
L424 GBCloud inference (efficient)
A100 40GB40 GBTraining, large inference
A100 80GB80 GBLarge model training
H100 80GB80 GBFastest training

Debugging Memory Issues

Debugging Memory Issues
OUT OF MEMORY (OOM)

                                                           
   Symptoms:                                               
CUDA out of memory error                            
Training crashes                                    
                                                           
   Immediate fixes:                                        
     1. Reduce batch size                                  
     2. Use gradient accumulation                          
     3. Enable gradient checkpointing                      
     4. Use mixed precision (FP16/BF16)                    
                                                           
   Longer-term fixes:                                      
     1. Use quantization (INT8/INT4)                       
     2. Use LoRA instead of full fine-tuning               
     3. Get more VRAM                                      
                                                           


MEMORY KEEPS GROWING

 
 Symptoms: 
 • Memory usage increases over time 
Eventually OOM 
 
 Causes: 
 • Not clearing cache 
 • Storing too much history 
 • KV cache not managed 
 
 Debug steps: 
 1. torch.cuda.empty_cache() between batches 
 2. del intermediate tensors 
 3. For inference: limit context length or use 
 sliding window 
 


When This Matters

SituationWhat to know
Choosing GPU for inferenceModel size x precision = VRAM needed
Running 7B locallyINT4 quantization, ~8GB needed
Training modelsNeed 4-6x model size in VRAM
Fine-tuning on consumer GPUUse LoRA + INT8/INT4
Getting OOM errorsReduce batch, use gradient accumulation
Understanding model cardsCheck precision (FP16, INT4, etc.)
Cost optimizationSmaller precision = cheaper inference
Understanding quantizationINT4 ~ 1-3% quality loss, 8x savings

Production signal

Why this concept matters

Interview 65% of ML infrastructure interviews
Production Model deployment and cost
Performance INT4 = 8x memory savings vs FP32