I/D/E · Generative AI

ML Metrics

Summary

Classification and retrieval metrics: precision, recall, F1, perplexity, MRR, and NDCG

TL;DR

Accuracy can be misleading with imbalanced data. Use precision when false positives are costly, recall when false negatives are costly, and F1 when you need balance. For retrieval, MRR measures first-result quality, NDCG measures ranking quality.

Visual Overview

Confusion Matrix

                                                           
                     Predicted                             
                  Pos      Neg                             
                                        
   Actual  Pos    TP      FN                            
                                        
           Neg    FP      TN                            
                                        
                                                           
    TP = True Positive  (correct positive)                 
    FP = False Positive (false alarm)                      
    FN = False Negative (missed)                           
    TN = True Negative  (correct negative)                 
                                                           


Classification Metrics

Precision

Precision

                                                           
   Precision = TP / (TP + FP)                              
                                                           
   "Of everything I predicted positive, how many were      
    actually positive?"                                    
                                                           
   High precision = few false alarms                       
   Optimize when: false positives are costly               
   (spam filter, fraud detection)                          
                                                           

Recall

Recall

                                                           
   Recall = TP / (TP + FN)                                 
                                                           
   "Of everything that was actually positive, how many     
    did I find?"                                           
                                                           
   High recall = few misses                                
   Optimize when: false negatives are costly               
   (disease detection, security threats)                   
                                                           

F1 Score

F1 Score

                                                           
   F1 = 2 x (Precision x Recall) / (Precision + Recall)    
                                                           
   Harmonic mean of precision and recall.                  
   Use when: you need single number, classes are imbalanced
                                                           
   Note: Harmonic mean punishes extreme imbalance harder   
   than arithmetic mean.                                   
                                                           

Precision-Recall Tradeoff

Precision-Recall Tradeoff
TRADEOFF

                                                           
        High Threshold              Low Threshold          
        (conservative)              (aggressive)           
                                                         
                                                         
                           
  P        HIGH                      LOW               
                           
                           
  R        LOW                       HIGH              
                           
                                                           
   Moving threshold trades one for the other.              
                                                           


Threshold Tuning

Default threshold is 0.5. This is arbitrary.

Choosing a Threshold

                                                           
   The right threshold depends on your cost function:      
                                                           
     FN 10x worse than FP?   Lower threshold              
     FP 10x worse than FN?   Higher threshold             
     Equal cost?             Optimize F1                  
                                                           

Common threshold strategies:

StrategyWhen to use
Maximize F1Balanced importance
Fixed recall (e.g., 95%)Can’t miss positives (medical)
Fixed precision (e.g., 95%)Can’t have false alarms (spam)
Cost-weightedKnow exact cost of FP vs FN

Class Imbalance

The Class Imbalance Problem
THE PROBLEM

                                                           
   Dataset: 99% negative, 1% positive (fraud detection)    
                                                           
   Model predicts "negative" for everything:               
     Accuracy = 99%   Looks great!                        
     Recall = 0%      Completely useless                  
                                                           
   Accuracy is misleading with imbalanced classes.         
                                                           

Solutions:

Solutions for Class Imbalance
1. USE DIFFERENT METRICS

                                                           
   Bad:  Accuracy                                          
   Good: F1, Precision, Recall, PR-AUC                     
                                                           
   F1 on "always negative" model = 0 (reveals the problem) 
                                                           


2. PR-AUC OVER ROC-AUC
 
  
  ROC-AUC can look good with imbalanced data (high TNR) 
  PR-AUC focuses on positive class performance 
  
  Severe imbalance? PR-AUC is the metric. 
  
 

3. MACRO VS MICRO AVERAGING
 
  
  Multi-class with imbalance: 
  
  Micro F1: Aggregate TP, FP, FN across all classes 
   Dominated by majority class 
  
  Macro F1: Compute F1 per class, then average 
   Each class weighted equally 
  
  Imbalanced? Use Macro F1 to ensure minority classes 
  matter. 
  
 
 

Retrieval Metrics

Precision@K and Recall@K

Precision and Recall at K

                                                           
   P@K = (relevant docs in top K) / K                      
   "Of the K results I returned, how many were relevant?"  
                                                           
   R@K = (relevant docs in top K) / (total relevant docs)  
   "Of all relevant docs, how many appear in my top K?"    
                                                           

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank
MRR

                                                           
   MRR = (1/|Q|) x SUM(1 / rank_i)                         
                                                           
   Where rank_i = position of first relevant result        
                                                           
   Example:                                                
     Query 1: first relevant at position 3  1/3           
     Query 2: first relevant at position 1  1/1           
     Query 3: first relevant at position 2  1/2           
                                                           
     MRR = (1/3 + 1 + 1/2) / 3 = 0.61                      
                                                           
   Use when: you care most about the first relevant result 
                                                           

NDCG (Normalized Discounted Cumulative Gain)

NDCG - Normalized Discounted Cumulative Gain
NDCG

                                                           
   DCG@K = SUM(relevance_i / log2(i + 1))                  
                                                           
   NDCG@K = DCG@K / IDCG@K  (normalized by ideal ranking)  
                                                           
   Accounts for:                                           
Graded relevance (not just binary)                  
Position discount (top results matter more)         
                                                           
   Use when: relevance has degrees, ranking order matters  
                                                           


When to Use What

ScenarioMetricWhy
Binary classification, balancedAccuracySimple, interpretable
Binary classification, imbalancedF1 or PR-AUCAccuracy misleading
Multi-class, balancedAccuracy or Micro F1Simple aggregate
Multi-class, imbalancedMacro F1Each class matters equally
Retrieval, one right answerMRRFirst result matters
Retrieval, multiple relevantNDCGRanking quality
RAG retrieval componentRecall@KDid we retrieve the context?
Medical/safety-criticalRecallCan’t miss positives
Spam/fraud filteringPrecisionCan’t have false alarms

Debugging with Metrics

Debugging with Metrics
HIGH ACCURACY BUT POOR REAL-WORLD PERFORMANCE

                                                           
   Symptoms:                                               
     • Model accuracy is 95%                               
Users complain it doesn't work                      
                                                           
   Causes:                                                 
     • Class imbalance (predicting majority class)         
     • Test set doesn't match production distribution      
     • Wrong metric for actual goal                        
                                                           
   Debug steps:                                            
     1. Check class balance in test set                    
     2. Compute per-class metrics (confusion matrix)       
     3. Compute F1, not just accuracy                      
     4. Sample production data and evaluate                
                                                           


PRECISION AND RECALL BOTH LOW

 
 Symptoms: 
 • P = 0.4, R = 0.3 (both bad) 
Model seems random 
 
 Causes: 
 • Model not learning (training issue) 
 • Features not predictive 
Data quality issues 
 
 Debug steps: 
 1. Check training loss - is it decreasing? 
 2. Overfit to small dataset first 
 3. Check feature importance 
 4. Examine misclassified examples manually 
 


When This Matters

SituationWhat to know
Evaluating any classifierCheck class balance first
Imbalanced dataUse F1 or PR-AUC, not accuracy
Multi-class imbalanceUse Macro F1
Setting classification thresholdTune based on cost of FP vs FN
Evaluating retrievalMRR for single answer, NDCG for ranking
RAG system evaluationRecall@K for retriever
Model seems good but users complainMetric doesn’t match user goal

Production signal

Why this concept matters

Interview 75% of ML interviews
Production Every model evaluation
Performance Choosing right metric for task