Deduplication

What it does

Removes duplicate rows from a CSV file. You can specify a subset of columns to consider for identifying duplicates, or exclude specific columns.

Python usage

from csvsmith.tools.row_dedup import dedupe_with_report

rows = [
    {"id": "1", "name": "Alice"},
    {"id": "2", "name": "Bob"},
    {"id": "1", "name": "Alice"},
]

# Remove duplicates by considering "id" and "name"
deduped, report = dedupe_with_report(rows, subset=["id", "name"])

CLI usage

csvsmith dedupe input.csv -o output.csv --subset id,name --keep first --report report.json

Behavior notes

  • Subset: Comma-separated list of columns to check for duplicates. If omitted, all columns are used.

  • Keep: Which record to keep: first (default) or last.

  • Report: Path to a JSON file where a summary of duplicates found will be saved.