TIL - Battling Database Duplicates - A Perspective on Data Integrity
Published: | at 03:42 AM
Today I Learned: Neglecting a unique index can lead to a cascade of data integrity issues, but SQL provides powerful tools to rectify such oversights.
The Duplicate Problem Flow
The Duplicate Problem Flow
┌────────────────────────────────────────────────────────┐│ THE DUPLICATE PROBLEM CASCADE ││││┌─────────────────────────────────────────────────┐│││ SCHEMA DESIGN (missing unique index) ││││ CREATE TABLE stock_feeds ( ││││ stock_reference_id INT, ││││ feed_reference_id INT ││││ -- No UNIQUE constraint! ││││ ); │││└───────────────────────┬─────────────────────────┘││▼││┌─────────────────────────────────────────────────┐│││ PRODUCTION: Duplicate INSERTs happen ││││││││ (stock_1, feed_1) ✓││││ (stock_1, feed_1) ✓← Should have been blocked ││││ (stock_1, feed_1) ✓← More duplicates... │││└───────────────────────┬─────────────────────────┘││▼││┌─────────────────────────────────────────────────┐│││ CASCADE EFFECTS: ││││ • Wrong counts in reports ││││ • JOIN results multiplied ││││ • Aggregations incorrect ││││ • Adding index NOW fails (duplicates exist) │││└───────────────────────┬─────────────────────────┘││▼││┌─────────────────────────────────────────────────┐│││ SOLUTION: Window functions + cleanup ││││││││ 1. Identify duplicates with ROW_NUMBER() ││││ 2. Delete all but first occurrence ││││ 3. Add unique constraint │││└─────────────────────────────────────────────────┘│└────────────────────────────────────────────────────────┘
Context
In our stock_feeds table, we omitted a unique index on (stock_reference_id, feed_reference_id). This oversight led to duplicate entries, causing a ripple effect across our data ecosystem.
The Challenge
Upon attempting to add the missing unique index post-facto, we encountered errors due to existing duplicates. This scenario underscores the critical nature of proactive schema design and the potential ramifications of overlooking seemingly minor details.
Solution Deep Dive
We employed window functions to identify and eliminate duplicates. Here’s the process:
SELECT *FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY stock_reference_id, feed_reference_id ORDER BY id ASC) AS rn FROM stock_feeds) tWHERE t.rn > 1;
Duplicate Removal
WITH duplicates AS ( SELECT id, ROW_NUMBER() OVER(PARTITION BY stock_reference_id, feed_reference_id ORDER BY id ASC) AS rn FROM stock_feeds)DELETE FROM stock_feedsWHERE id NOT IN ( SELECT id FROM duplicates WHERE rn = 1);
Key Insights
Window functions like ROW_NUMBER() are invaluable for handling duplicate data scenarios.
CTEs (Common Table Expressions) enhance query readability and maintainability.
The ORDER BY clause within the OVER() function is crucial for deterministic duplicate handling.
Alternative Approaches
While window functions worked well in our case, it’s crucial to be aware of other techniques:
Duplicate Detection Approaches Comparison
Duplicate Detection Approaches Comparison
┌────────────────────────────────────────────────────────┐│ APPROACH COMPARISON FOR DUPLICATES ││││┌─────────────────────────────────────────────────┐│││ WINDOW FUNCTION (ROW_NUMBER) ││││ Best for: Cleanup, keeping specific record ││││ Pros: Can choose which duplicate to keep ││││ Cons: Memory-intensive on large tables │││└─────────────────────────────────────────────────┘││││┌─────────────────────────────────────────────────┐│││GROUP BY + HAVING ││││ Best for: Counting duplicates, simple detect ││││ Pros: Simple, fast with proper indexing ││││ Cons: Doesn't identify which to delete │││└─────────────────────────────────────────────────┘││││┌─────────────────────────────────────────────────┐│││ SELF-JOIN ││││ Best for: Complex duplicate conditions ││││ Pros: Flexible, can leverage indexes ││││ Cons: Can be slow on very large tables │││└─────────────────────────────────────────────────┘││││┌─────────────────────────────────────────────────┐│││EXISTS SUBQUERY ││││ Best for: Existence checks ││││ Pros: Efficient when optimized properly ││││ Cons: Query planner dependent │││└─────────────────────────────────────────────────┘│└────────────────────────────────────────────────────────┘