csvsmith package¶

csvsmith: small, focused CSV utilities.

Public API: - count_duplicates_sorted - add_row_digest - find_duplicate_rows - dedupe_with_report - read_csv_rows - write_csv_rows - CSVClassifier - DropRowsBySubstring - excel_to_csv - move_by_suffix - StringDistance - Relation - Result - analyze_pair - strict_concat_rows - save_csv - concentrate_csv - rehydrate_csv

Compatibility aliases: - CSVCleaner

Submodules: - csvsmith.clean_numeric - csvsmith.string_distance - csvsmith.row_dedup - csvsmith.classify - csvsmith.filter_rows - csvsmith.excel2csv - csvsmith.move_files - csvsmith.strict_concat - csvsmith.cli (CLI entrypoint)

class csvsmith.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]¶

Bases: object

Classifies CSV files into folders based on header signatures.

Two orthogonal controls:

mode: “strict” | “relaxed”
match: “exact” | “contains” (contains is your legacy behavior)

signatures:

dict[sub, list[sub]]

category -> expected columns
interpretation depends on match:
exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header

rollback(manifest_path: str | Path) → None[source]¶

run() → None[source]¶

class csvsmith.ConcentrateResult(row_count: int, transformed_cell_count: int, mapped_value_count: int, output_csv_path: Path, output_map_path: Path)[source]¶

Bases: object

Summary of a completed CSV concentration operation.

mapped_value_count: int¶

output_csv_path: Path¶

output_map_path: Path¶

row_count: int¶

transformed_cell_count: int¶

class csvsmith.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]¶

Bases: object

Filter CSV rows by removing rows whose selected column contains a target substring.

FILTERED_SUFFIX = '.filtered.csv'¶

iter_kept_rows() → Generator[list[str], None, None][source]¶

write_filtered_rows() → None[source]¶

class csvsmith.RehydrateResult(row_count: int, restored_cell_count: int, output_csv_path: Path)[source]¶

Bases: object

Summary of a completed CSV rehydration operation.

output_csv_path: Path¶

restored_cell_count: int¶

row_count: int¶

class csvsmith.Relation(*values)[source]¶

Bases: Enum

CASE_INSENSITIVE_MATCH = 2¶

EXACT_MATCH = 1¶

NORMALIZED_SPACE_MATCH = 4¶

NO_STRUCTURAL_MATCH = 5¶

WHITESPACE_TRIMMED_MATCH = 3¶

class csvsmith.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]¶

Bases: object

classification: Relation¶

damerau_levenshtein_distance: int¶

get_relation_string() → str[source]¶

jaro_winkler_score: float¶

similarity_percentage: float¶

class csvsmith.StringDistance[source]¶

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:: Possible relationship classification between two strings.
Return damerau_levenshtein_distance:: Integer distance calculated using the Damerau-Levenshtein algorithm.
Return jaro_winkler_score:: A float score indicating similarity using the Jaro-Winkler algorithm.
Return similarity_percentage:: A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) → Result[source]¶

static calculate_damerau_levenshtein(s1: str, s2: str) → int[source]¶: Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) → float[source]¶

static classify(a: str, b: str) → Relation[source]¶

static equals_ignore_case(a: str, b: str) → bool[source]¶

static strip_all(s: str) → str[source]¶

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) → str[source]¶

csvsmith.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) → list[dict[str, object]][source]¶: Add a row digest column and return the resulting rows.

csvsmith.analyze_pair(a: str, b: str, ignore_case: bool = False) → Result[source]¶

csvsmith.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) → float | Any[source]¶: Cleans and converts a given input to a float by normalizing its numeric representation.

csvsmith.concentrate_csv(input_path: Path | str, output_csv_path: Path | str, output_map_path: Path | str, *, columns: Sequence[str] | None = None, min_occurrences: int = 2, encoding: str = 'utf-8-sig') → ConcentrateResult[source]¶

Replace repeated values in selected CSV columns with deterministic tokens.

The output map is versioned and records the source header and transformed column indices so that rehydration can validate its input.

csvsmith.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) → list[tuple[Hashable, int]][source]¶: Count items and return those occurring at least threshold times.

csvsmith.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') → tuple[list[dict[str, object]], list[dict[str, object]]][source]¶: Drop duplicates and return (deduped_rows, report).

csvsmith.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) → Path[source]¶: Convert one Excel worksheet into a CSV file.

csvsmith.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) → list[dict[str, object]][source]¶: Return only rows that participate in duplicate groups.

csvsmith.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]¶: Scans a CSV for a key and returns coordinates and neighbor data.

csvsmith.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) → int[source]¶

Move files from src_dir to dst_dir when their suffix matches.

Suffix matching is case-insensitive and accepts values with or without a leading dot (for example, "csv" and ".csv" are treated the same).

Parameters:

src_dir – Source directory to scan for files.
dst_dir – Destination directory where matching files are moved.
suffixes – File suffixes to match against.

Returns:

The number of files moved.

csvsmith.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') → list[dict[str, Any]][source]¶: Read a CSV file into a list of row dictionaries.

csvsmith.rehydrate_csv(input_csv_path: Path | str, map_path: Path | str, output_csv_path: Path | str, *, encoding: str = 'utf-8-sig') → RehydrateResult[source]¶: Restore tokenized cells using a validated dense CSV map.

csvsmith.save_csv(rows: Iterable[list[str]], out_path: Path | str) → None[source]¶: Write rows to out_path.

csvsmith.strict_concat_rows(csv_dir: Path | str) → list[list[str]][source]¶

Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.

Parameters:: csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.
Returns:: A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.
Return type:: list[list[str]]
Raises:: FileNotFoundError – If no CSV files are found in the provided directory.

csvsmith.write_csv_rows(csv_path: Path | str, rows: Sequence[Mapping[str, object]], *, fieldnames: Sequence[str], encoding: str = 'utf-8-sig') → None[source]¶: Write row dictionaries to a CSV file.