Classification and retrieval metrics: precision, recall, F1, perplexity, MRR, and NDCG
TL;DR
Accuracy can be misleading with imbalanced data. Use precision when false positives are costly, recall when false negatives are costly, and F1 when you need balance. For retrieval, MRR measures first-result quality, NDCG measures ranking quality.
Visual Overview
Classification Metrics
Precision
Recall
F1 Score
Precision-Recall Tradeoff
Threshold Tuning
Default threshold is 0.5. This is arbitrary.
Common threshold strategies:
| Strategy | When to use |
|---|---|
| Maximize F1 | Balanced importance |
| Fixed recall (e.g., 95%) | Can’t miss positives (medical) |
| Fixed precision (e.g., 95%) | Can’t have false alarms (spam) |
| Cost-weighted | Know exact cost of FP vs FN |
Class Imbalance
Solutions:
Retrieval Metrics
Precision@K and Recall@K
Mean Reciprocal Rank (MRR)
NDCG (Normalized Discounted Cumulative Gain)
When to Use What
| Scenario | Metric | Why |
|---|---|---|
| Binary classification, balanced | Accuracy | Simple, interpretable |
| Binary classification, imbalanced | F1 or PR-AUC | Accuracy misleading |
| Multi-class, balanced | Accuracy or Micro F1 | Simple aggregate |
| Multi-class, imbalanced | Macro F1 | Each class matters equally |
| Retrieval, one right answer | MRR | First result matters |
| Retrieval, multiple relevant | NDCG | Ranking quality |
| RAG retrieval component | Recall@K | Did we retrieve the context? |
| Medical/safety-critical | Recall | Can’t miss positives |
| Spam/fraud filtering | Precision | Can’t have false alarms |
Debugging with Metrics
When This Matters
| Situation | What to know |
|---|---|
| Evaluating any classifier | Check class balance first |
| Imbalanced data | Use F1 or PR-AUC, not accuracy |
| Multi-class imbalance | Use Macro F1 |
| Setting classification threshold | Tune based on cost of FP vs FN |
| Evaluating retrieval | MRR for single answer, NDCG for ranking |
| RAG system evaluation | Recall@K for retriever |
| Model seems good but users complain | Metric doesn’t match user goal |
Interview Notes
75% of ML interviews
Interview Relevance
75% of ML interviews
75% of ML interviews
Every model evaluation
Production Impact
Powers systems at Every model evaluation
Powers systems at Every model evaluation
Choosing right metric for task
Performance
Choosing right metric for task query improvement
Choosing right metric for task query improvement