csvsmith.tools¶
The tools package contains the core logic for csvsmith’s data processing and file management capabilities. Each tool is designed to be usable both as a standalone CLI command and as a reusable Python class or function.
Submodules¶
- class csvsmith.tools.classify.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]¶
Bases:
objectClassifies CSV files into folders based on header signatures.
- Two orthogonal controls:
mode: “strict” | “relaxed”
match: “exact” | “contains” (contains is your legacy behavior)
- signatures:
- dict[sub, list[sub]]
category -> expected columns
- interpretation depends on match:
exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header
- class csvsmith.tools.classify.HeaderKey(mode: str, cols: tuple[str, ...])[source]¶
Bases:
objectHashable header signature.
mode=”strict” -> ordered tuple (col order matters) mode=”relaxed” -> sorted unique tuple (col order does NOT matter)
- cols: tuple[str, ...]¶
- mode: str¶
- csvsmith.tools.excel2csv.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) Path[source]¶
Convert one Excel worksheet into a CSV file.
- csvsmith.tools.filter_rows.CSVCleaner¶
alias of
DropRowsBySubstring
- class csvsmith.tools.filter_rows.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]¶
Bases:
objectFilter CSV rows by removing rows whose selected column contains a target substring.
- FILTERED_SUFFIX = '.filtered.csv'¶
- csvsmith.tools.filter_rows.main(csv_path: Path | str, column_name: str, unwanted_text: str) None[source]¶
- csvsmith.tools.find_matches_in_csv.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]¶
Scans a CSV for a key and returns coordinates and neighbor data.
- csvsmith.tools.move_files.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) int[source]¶
Move files from
src_dirtodst_dirwhen their suffix matches.Suffix matching is case-insensitive and accepts values with or without a leading dot (for example,
"csv"and".csv"are treated the same).- Parameters:
src_dir – Source directory to scan for files.
dst_dir – Destination directory where matching files are moved.
suffixes – File suffixes to match against.
- Returns:
The number of files moved.
- csvsmith.tools.row_dedup.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) list[dict[str, object]][source]¶
Add a row digest column and return the resulting rows.
- csvsmith.tools.row_dedup.dedupe_csv_file(src: Path | str, dst: Path | str, *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', encoding: str = 'utf-8-sig') list[dict[str, object]][source]¶
Deduplicate a CSV file, write the result, and return the report.
- csvsmith.tools.row_dedup.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') tuple[list[dict[str, object]], list[dict[str, object]]][source]¶
Drop duplicates and return (deduped_rows, report).
- csvsmith.tools.row_dedup.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) list[dict[str, object]][source]¶
Return only rows that participate in duplicate groups.
- csvsmith.tools.row_dedup.make_row_digest(row: Mapping[str, object], *, columns: Sequence[str]) str[source]¶
Build a SHA-256 digest for a row using selected columns.
- csvsmith.tools.strict_concat.find_csvs(csv_dir: Path | str) list[Path][source]¶
Find all CSV files in the specified directory.
This function searches for all files with a
.csvextension in the given directory and returns a sorted list of their paths.- Parameters:
csv_dir (Path | str) – The directory to search for CSV files. This can be provided as either a
Pathobject or a string representing the path to the directory.- Returns:
Sorted list of paths to all
.csvfiles found in the specified directory.- Return type:
list[Path]
- csvsmith.tools.strict_concat.read_header(csv_path: Path) list[str][source]¶
Reads the header row of a given CSV file.
This function opens a CSV file located at the specified path, reads its first row, and returns it as a list of strings. The file is assumed to be encoded in UTF-8 with optional BOM (Byte Order Mark). If the CSV file is empty, a ValueError is raised indicating the problem. The CSV file is expected to be opened in read mode with no newline translation.
- Parameters:
csv_path (Path) – The path to the CSV file to read the header from.
- Returns:
A list of strings representing the header row of the CSV file.
- Return type:
list[str]
- Raises:
ValueError – If the CSV file is empty and no header can be retrieved.
- csvsmith.tools.strict_concat.save_csv(rows: Iterable[list[str]], out_path: Path | str) None[source]¶
Write rows to out_path.
- csvsmith.tools.strict_concat.strict_concat_rows(csv_dir: Path | str) list[list[str]][source]¶
Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.
- Parameters:
csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.
- Returns:
A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.
- Return type:
list[list[str]]
- Raises:
FileNotFoundError – If no CSV files are found in the provided directory.