csvsmith package

csvsmith: small, focused CSV utilities.

Public API: - count_duplicates_sorted - add_row_digest - find_duplicate_rows - dedupe_with_report - read_csv_rows - write_csv_rows - CSVClassifier - DropRowsBySubstring - excel_to_csv - move_by_suffix - StringDistance - Relation - Result - analyze_pair - strict_concat_rows - save_csv

Compatibility aliases: - CSVCleaner

Submodules: - csvsmith.clean_numeric - csvsmith.string_distance - csvsmith.row_dedup - csvsmith.classify - csvsmith.filter_rows - csvsmith.excel2csv - csvsmith.move_files - csvsmith.strict_concat - csvsmith.cli (CLI entrypoint)

class csvsmith.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]

Bases: object

Classifies CSV files into folders based on header signatures.

Two orthogonal controls:
  • mode: “strict” | “relaxed”

  • match: “exact” | “contains” (contains is your legacy behavior)

signatures:
dict[sub, list[sub]]
  • category -> expected columns

  • interpretation depends on match:

    exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header

rollback(manifest_path: str | Path) None[source]
run() None[source]
class csvsmith.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]

Bases: object

Filter CSV rows by removing rows whose selected column contains a target substring.

FILTERED_SUFFIX = '.filtered.csv'
iter_kept_rows() Generator[list[str], None, None][source]
write_filtered_rows() None[source]
class csvsmith.Relation(*values)[source]

Bases: Enum

CASE_INSENSITIVE_MATCH = 2
EXACT_MATCH = 1
NORMALIZED_SPACE_MATCH = 4
NO_STRUCTURAL_MATCH = 5
WHITESPACE_TRIMMED_MATCH = 3
class csvsmith.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]

Bases: object

classification: Relation
damerau_levenshtein_distance: int
get_relation_string() str[source]
jaro_winkler_score: float
similarity_percentage: float
class csvsmith.StringDistance[source]

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:

Possible relationship classification between two strings.

Return damerau_levenshtein_distance:

Integer distance calculated using the Damerau-Levenshtein algorithm.

Return jaro_winkler_score:

A float score indicating similarity using the Jaro-Winkler algorithm.

Return similarity_percentage:

A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) Result[source]
static calculate_damerau_levenshtein(s1: str, s2: str) int[source]

Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) float[source]
static classify(a: str, b: str) Relation[source]
static equals_ignore_case(a: str, b: str) bool[source]
static strip_all(s: str) str[source]

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) str[source]
csvsmith.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) list[dict[str, object]][source]

Add a row digest column and return the resulting rows.

csvsmith.analyze_pair(a: str, b: str, ignore_case: bool = False) Result[source]
csvsmith.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]

Cleans and converts a given input to a float by normalizing its numeric representation.

csvsmith.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) list[tuple[Hashable, int]][source]

Count items and return those occurring at least threshold times.

csvsmith.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') tuple[list[dict[str, object]], list[dict[str, object]]][source]

Drop duplicates and return (deduped_rows, report).

csvsmith.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) Path[source]

Convert one Excel worksheet into a CSV file.

csvsmith.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) list[dict[str, object]][source]

Return only rows that participate in duplicate groups.

csvsmith.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]

Scans a CSV for a key and returns coordinates and neighbor data.

csvsmith.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) int[source]

Move files from src_dir to dst_dir when their suffix matches.

Suffix matching is case-insensitive and accepts values with or without a leading dot (for example, "csv" and ".csv" are treated the same).

Parameters:
  • src_dir – Source directory to scan for files.

  • dst_dir – Destination directory where matching files are moved.

  • suffixes – File suffixes to match against.

Returns:

The number of files moved.

csvsmith.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') list[dict[str, Any]][source]

Read a CSV file into a list of row dictionaries.

csvsmith.save_csv(rows: Iterable[list[str]], out_path: Path | str) None[source]

Write rows to out_path.

csvsmith.strict_concat_rows(csv_dir: Path | str) list[list[str]][source]

Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.

Parameters:

csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.

Returns:

A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.

Return type:

list[list[str]]

Raises:

FileNotFoundError – If no CSV files are found in the provided directory.

csvsmith.write_csv_rows(csv_path: Path | str, rows: Sequence[Mapping[str, object]], *, fieldnames: Sequence[str], encoding: str = 'utf-8-sig') None[source]

Write row dictionaries to a CSV file.