Skip to content

ML Metrics

Classification and retrieval metrics: precision, recall, F1, perplexity, MRR, and NDCG

TL;DR

Accuracy can be misleading with imbalanced data. Use precision when false positives are costly, recall when false negatives are costly, and F1 when you need balance. For retrieval, MRR measures first-result quality, NDCG measures ranking quality.

Visual Overview

Confusion Matrix

Classification Metrics

Precision

Precision

Recall

Recall

F1 Score

F1 Score

Precision-Recall Tradeoff

Precision-Recall Tradeoff

Threshold Tuning

Default threshold is 0.5. This is arbitrary.

Choosing a Threshold

Common threshold strategies:

StrategyWhen to use
Maximize F1Balanced importance
Fixed recall (e.g., 95%)Can’t miss positives (medical)
Fixed precision (e.g., 95%)Can’t have false alarms (spam)
Cost-weightedKnow exact cost of FP vs FN

Class Imbalance

The Class Imbalance Problem

Solutions:

Solutions for Class Imbalance

Retrieval Metrics

Precision@K and Recall@K

Precision and Recall at K

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank

NDCG (Normalized Discounted Cumulative Gain)

NDCG - Normalized Discounted Cumulative Gain

When to Use What

ScenarioMetricWhy
Binary classification, balancedAccuracySimple, interpretable
Binary classification, imbalancedF1 or PR-AUCAccuracy misleading
Multi-class, balancedAccuracy or Micro F1Simple aggregate
Multi-class, imbalancedMacro F1Each class matters equally
Retrieval, one right answerMRRFirst result matters
Retrieval, multiple relevantNDCGRanking quality
RAG retrieval componentRecall@KDid we retrieve the context?
Medical/safety-criticalRecallCan’t miss positives
Spam/fraud filteringPrecisionCan’t have false alarms

Debugging with Metrics

Debugging with Metrics

When This Matters

SituationWhat to know
Evaluating any classifierCheck class balance first
Imbalanced dataUse F1 or PR-AUC, not accuracy
Multi-class imbalanceUse Macro F1
Setting classification thresholdTune based on cost of FP vs FN
Evaluating retrievalMRR for single answer, NDCG for ranking
RAG system evaluationRecall@K for retriever
Model seems good but users complainMetric doesn’t match user goal
Interview Notes
💼75% of ML interviews
Interview Relevance
75% of ML interviews
🏭Every model evaluation
Production Impact
Powers systems at Every model evaluation
Choosing right metric for task
Performance
Choosing right metric for task query improvement