I/D/E · Essay

TIL - Battling Database Duplicates - A Perspective on Data Integrity

Summary

TIL POST: A Perspective on Data Integrity

Today I Learned: Neglecting a unique index can lead to a cascade of data integrity issues, but SQL provides powerful tools to rectify such oversights.

The Duplicate Problem Flow

  THE DUPLICATE PROBLEM CASCADE                         
                                                        
     
    SCHEMA DESIGN (missing unique index)              
    CREATE TABLE stock_feeds (                        
      stock_reference_id INT,                         
      feed_reference_id INT                           
      -- No UNIQUE constraint!                        
    );                                                
     
                                                       
     
    PRODUCTION: Duplicate INSERTs happen              
                                                      
    (stock_1, feed_1)                                
    (stock_1, feed_1)   Should have been blocked    
    (stock_1, feed_1)   More duplicates...          
     
                                                       
     
    CASCADE EFFECTS:                                  
    • Wrong counts in reports                         
    • JOIN results multiplied                         
    • Aggregations incorrect                          
    • Adding index NOW fails (duplicates exist)       
     
                                                       
     
    SOLUTION: Window functions + cleanup              
                                                      
    1. Identify duplicates with ROW_NUMBER()          
    2. Delete all but first occurrence                
    3. Add unique constraint                          
     

Context

In our stock_feeds table, we omitted a unique index on (stock_reference_id, feed_reference_id). This oversight led to duplicate entries, causing a ripple effect across our data ecosystem.

The Challenge

Upon attempting to add the missing unique index post-facto, we encountered errors due to existing duplicates. This scenario underscores the critical nature of proactive schema design and the potential ramifications of overlooking seemingly minor details.

Solution Deep Dive

We employed window functions to identify and eliminate duplicates. Here’s the process:

Window Function: ROW_NUMBER() Explained

  ROW_NUMBER() FOR DUPLICATE DETECTION                    
                                                          
  Original Data                                           
             
   id  stock_reference_id  feed_ref_id                
             
    1  100                 200                        
    5  100                 200               dup     
    9  100                 200               dup     
    2  101                 201                        
    7  101                 201               dup     
             
                                                          
  After ROW_NUMBER() PARTITION BY (stock_reference_id,    
                                   feed_reference_id):    
        
   id  stock_reference_id  feed_ref_id      rn       
        
    1  100                 200               1      
    5  100                 200               2      
    9  100                 200               3      
    2  101                 201               1      
    7  101                 201               2      
        
                                                          
  Keep rows where rn = 1  Removes all duplicates         

Duplicate Identification

SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY stock_reference_id, feed_reference_id ORDER BY id ASC) AS rn
    FROM stock_feeds
) t
WHERE t.rn > 1;

Duplicate Removal

WITH duplicates AS (
    SELECT id,
    ROW_NUMBER() OVER(PARTITION BY stock_reference_id, feed_reference_id ORDER BY id ASC) AS rn
    FROM stock_feeds
)
DELETE FROM stock_feeds
WHERE id NOT IN (
    SELECT id FROM duplicates WHERE rn = 1
);

Key Insights

  • Window functions like ROW_NUMBER() are invaluable for handling duplicate data scenarios.
  • CTEs (Common Table Expressions) enhance query readability and maintainability.
  • The ORDER BY clause within the OVER() function is crucial for deterministic duplicate handling.

Alternative Approaches

While window functions worked well in our case, it’s crucial to be aware of other techniques:

Duplicate Detection Approaches Comparison

  APPROACH COMPARISON FOR DUPLICATES                    
                                                        
     
    WINDOW FUNCTION (ROW_NUMBER)                      
    Best for: Cleanup, keeping specific record        
    Pros: Can choose which duplicate to keep          
    Cons: Memory-intensive on large tables            
     
                                                        
     
    GROUP BY + HAVING                                 
    Best for: Counting duplicates, simple detect      
    Pros: Simple, fast with proper indexing           
    Cons: Doesn't identify which to delete            
     
                                                        
     
    SELF-JOIN                                         
    Best for: Complex duplicate conditions            
    Pros: Flexible, can leverage indexes              
    Cons: Can be slow on very large tables            
     
                                                        
     
    EXISTS SUBQUERY                                   
    Best for: Existence checks                        
    Pros: Efficient when optimized properly           
    Cons: Query planner dependent                     
     

GROUP BY with HAVING

SELECT stock_reference_id, feed_reference_id, COUNT(*)
FROM stock_feeds
GROUP BY stock_reference_id, feed_reference_id
HAVING COUNT(*) > 1;

Self-Join

SELECT DISTINCT sf1.*
FROM stock_feeds sf1
INNER JOIN stock_feeds sf2 ON
    sf1.stock_reference_id = sf2.stock_reference_id AND
    sf1.feed_reference_id = sf2.feed_reference_id AND
    sf1.id > sf2.id;

EXISTS Subquery

SELECT sf1.*
FROM stock_feeds sf1
WHERE EXISTS (
    SELECT 1
    FROM stock_feeds sf2
    WHERE sf1.stock_reference_id = sf2.stock_reference_id AND
          sf1.feed_reference_id = sf2.feed_reference_id AND
          sf1.id > sf2.id
);

Performance Considerations

  • Window functions can be memory-intensive for large datasets.
  • GROUP BY might perform better on smaller tables or with appropriate indexing.
  • Self-joins can leverage index seeks but may struggle with very large tables.
  • EXISTS can be efficient when properly optimized by the query planner.