csvsmith package¶
csvsmith: small, focused CSV utilities.
Public API: - count_duplicates_sorted - add_row_digest - find_duplicate_rows - dedupe_with_report - read_csv_rows - write_csv_rows - CSVClassifier - DropRowsBySubstring - excel_to_csv - move_by_suffix - StringDistance - Relation - Result - analyze_pair - strict_concat_rows - save_csv
Compatibility aliases: - CSVCleaner
Submodules: - csvsmith.clean_numeric - csvsmith.string_distance - csvsmith.row_dedup - csvsmith.classify - csvsmith.filter_rows - csvsmith.excel2csv - csvsmith.move_files - csvsmith.strict_concat - csvsmith.cli (CLI entrypoint)
- class csvsmith.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]¶
Bases:
objectClassifies CSV files into folders based on header signatures.
- Two orthogonal controls:
mode: “strict” | “relaxed”
match: “exact” | “contains” (contains is your legacy behavior)
- signatures:
- dict[sub, list[sub]]
category -> expected columns
- interpretation depends on match:
exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header
- class csvsmith.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]¶
Bases:
objectFilter CSV rows by removing rows whose selected column contains a target substring.
- FILTERED_SUFFIX = '.filtered.csv'¶
- class csvsmith.Relation(*values)[source]¶
Bases:
Enum- CASE_INSENSITIVE_MATCH = 2¶
- EXACT_MATCH = 1¶
- NORMALIZED_SPACE_MATCH = 4¶
- NO_STRUCTURAL_MATCH = 5¶
- WHITESPACE_TRIMMED_MATCH = 3¶
- class csvsmith.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]¶
Bases:
object- damerau_levenshtein_distance: int¶
- jaro_winkler_score: float¶
- similarity_percentage: float¶
- class csvsmith.StringDistance[source]¶
Bases:
objectProvides functionality for calculating string distances and relationships between two strings based on various algorithms.
This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.
- Return classification:
Possible relationship classification between two strings.
- Return damerau_levenshtein_distance:
Integer distance calculated using the Damerau-Levenshtein algorithm.
- Return jaro_winkler_score:
A float score indicating similarity using the Jaro-Winkler algorithm.
- Return similarity_percentage:
A percentage similarity score between two strings.
- static calculate_damerau_levenshtein(s1: str, s2: str) int[source]¶
Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.
- csvsmith.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) list[dict[str, object]][source]¶
Add a row digest column and return the resulting rows.
- csvsmith.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]¶
Cleans and converts a given input to a float by normalizing its numeric representation.
- csvsmith.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) list[tuple[Hashable, int]][source]¶
Count items and return those occurring at least threshold times.
- csvsmith.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') tuple[list[dict[str, object]], list[dict[str, object]]][source]¶
Drop duplicates and return (deduped_rows, report).
- csvsmith.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) Path[source]¶
Convert one Excel worksheet into a CSV file.
- csvsmith.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) list[dict[str, object]][source]¶
Return only rows that participate in duplicate groups.
- csvsmith.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]¶
Scans a CSV for a key and returns coordinates and neighbor data.
- csvsmith.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) int[source]¶
Move files from
src_dirtodst_dirwhen their suffix matches.Suffix matching is case-insensitive and accepts values with or without a leading dot (for example,
"csv"and".csv"are treated the same).- Parameters:
src_dir – Source directory to scan for files.
dst_dir – Destination directory where matching files are moved.
suffixes – File suffixes to match against.
- Returns:
The number of files moved.
- csvsmith.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') list[dict[str, Any]][source]¶
Read a CSV file into a list of row dictionaries.
- csvsmith.save_csv(rows: Iterable[list[str]], out_path: Path | str) None[source]¶
Write rows to out_path.
- csvsmith.strict_concat_rows(csv_dir: Path | str) list[list[str]][source]¶
Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.
- Parameters:
csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.
- Returns:
A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.
- Return type:
list[list[str]]
- Raises:
FileNotFoundError – If no CSV files are found in the provided directory.