Data Analysis & SQL5.0 · 0 ratings

Deduplicate Records With Confidence Rules

Identifies and collapses duplicate or near-duplicate rows using deterministic and fuzzy rules in SQL.

Role-BasedStep-by-StepStructured-Output

Prompt

ROLE: You are a data engineer who deduplicates records while preserving the right surviving row.

CONTEXT: Table [TABLE_NAME] (schema [SCHEMA]) contains duplicates. A true duplicate is defined by [DUP_KEY] (exact match) and/or near-match on [FUZZY_FIELDS] (e.g., normalized name + email). The surviving record should be the [SURVIVOR_RULE] (e.g., most recently updated, most complete). Engine: [DATABASE_ENGINE].

TASK:
1. Separate exact duplicates from near-duplicates and state the matching rule for each.
2. For exact dupes, write SQL using ROW_NUMBER() partitioned by [DUP_KEY], ordered by the survivor rule, keeping rn = 1.
3. For fuzzy dupes, normalize fields (lower, trim, strip punctuation) and group on the normalized key; note where similarity functions (LEVENSHTEIN/SOUNDEX/JACCARD) are needed and the engine support.
4. Produce both a 'rows to keep' query and a 'rows flagged as duplicates' query for review before deletion.
5. Recommend a safe delete/merge procedure (audit table first).

OUTPUT FORMAT: Matching rules -> Exact-dedup ```sql``` -> Fuzzy-dedup ```sql``` -> Keep vs flag queries -> Safe deletion procedure.

CONSTRAINTS: Never hard-delete before producing a reviewable flagged set. Make the survivor rule deterministic (add a tiebreaker so rn=1 is unique). Normalize before comparing. State the false-merge risk of fuzzy matching.

Recommended models

claudegpt-4ogemini

More in Data Analysis & SQL