csvsmith.tools¶

The tools package contains the core logic for csvsmith’s data processing and file management capabilities. Each tool is designed to be usable both as a standalone CLI command and as a reusable Python class or function.

Submodules¶

class csvsmith.tools.classify.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]¶

Bases: object

Classifies CSV files into folders based on header signatures.

Two orthogonal controls:

mode: “strict” | “relaxed”
match: “exact” | “contains” (contains is your legacy behavior)

signatures:

dict[sub, list[sub]]

category -> expected columns
interpretation depends on match:
exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header

rollback(manifest_path: str | Path) → None[source]¶

run() → None[source]¶

class csvsmith.tools.classify.HeaderKey(mode: str, cols: tuple[str, ...])[source]¶

Bases: object

Hashable header signature.

mode=”strict” -> ordered tuple (col order matters) mode=”relaxed” -> sorted unique tuple (col order does NOT matter)

cols: tuple[str, ...]¶

mode: str¶

class csvsmith.tools.dense_csv.ConcentrateResult(row_count: int, transformed_cell_count: int, mapped_value_count: int, output_csv_path: Path, output_map_path: Path)[source]¶

Bases: object

Summary of a completed CSV concentration operation.

mapped_value_count: int¶

output_csv_path: Path¶

output_map_path: Path¶

row_count: int¶

transformed_cell_count: int¶

class csvsmith.tools.dense_csv.RehydrateResult(row_count: int, restored_cell_count: int, output_csv_path: Path)[source]¶

Bases: object

Summary of a completed CSV rehydration operation.

output_csv_path: Path¶

restored_cell_count: int¶

row_count: int¶

csvsmith.tools.dense_csv.concentrate_csv(input_path: Path | str, output_csv_path: Path | str, output_map_path: Path | str, *, columns: Sequence[str] | None = None, min_occurrences: int = 2, encoding: str = 'utf-8-sig') → ConcentrateResult[source]¶

Replace repeated values in selected CSV columns with deterministic tokens.

The output map is versioned and records the source header and transformed column indices so that rehydration can validate its input.

csvsmith.tools.dense_csv.generate_hash(text: str) → str[source]¶: Return a deterministic SHA-256 hexadecimal digest for text.

csvsmith.tools.dense_csv.rehydrate_csv(input_csv_path: Path | str, map_path: Path | str, output_csv_path: Path | str, *, encoding: str = 'utf-8-sig') → RehydrateResult[source]¶: Restore tokenized cells using a validated dense CSV map.

csvsmith.tools.excel2csv.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) → Path[source]¶: Convert one Excel worksheet into a CSV file.

csvsmith.tools.filter_rows.CSVCleaner¶: alias of DropRowsBySubstring

class csvsmith.tools.filter_rows.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]¶

Bases: object

Filter CSV rows by removing rows whose selected column contains a target substring.

FILTERED_SUFFIX = '.filtered.csv'¶

iter_kept_rows() → Generator[list[str], None, None][source]¶

write_filtered_rows() → None[source]¶

csvsmith.tools.filter_rows.main(csv_path: Path | str, column_name: str, unwanted_text: str) → None[source]¶

csvsmith.tools.find_matches_in_csv.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]¶: Scans a CSV for a key and returns coordinates and neighbor data.

csvsmith.tools.move_files.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) → int[source]¶

Move files from src_dir to dst_dir when their suffix matches.

Suffix matching is case-insensitive and accepts values with or without a leading dot (for example, "csv" and ".csv" are treated the same).

Parameters:

src_dir – Source directory to scan for files.
dst_dir – Destination directory where matching files are moved.
suffixes – File suffixes to match against.

Returns:

The number of files moved.

csvsmith.tools.move_files.normalize_suffixes(suffixes: Iterable[str]) → set[str][source]¶

csvsmith.tools.row_dedup.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) → list[dict[str, object]][source]¶: Add a row digest column and return the resulting rows.

csvsmith.tools.row_dedup.dedupe_csv_file(src: Path | str, dst: Path | str, *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', encoding: str = 'utf-8-sig') → list[dict[str, object]][source]¶: Deduplicate a CSV file, write the result, and return the report.

csvsmith.tools.row_dedup.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') → tuple[list[dict[str, object]], list[dict[str, object]]][source]¶: Drop duplicates and return (deduped_rows, report).

csvsmith.tools.row_dedup.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) → list[dict[str, object]][source]¶: Return only rows that participate in duplicate groups.

csvsmith.tools.row_dedup.make_row_digest(row: Mapping[str, object], *, columns: Sequence[str]) → str[source]¶: Build a SHA-256 digest for a row using selected columns.

csvsmith.tools.strict_concat.find_csvs(csv_dir: Path | str) → list[Path][source]¶

Find all CSV files in the specified directory.

This function searches for all files with a .csv extension in the given directory and returns a sorted list of their paths.

Parameters:: csv_dir (Path | str) – The directory to search for CSV files. This can be provided as either a Path object or a string representing the path to the directory.
Returns:: Sorted list of paths to all .csv files found in the specified directory.
Return type:: list[Path]

csvsmith.tools.strict_concat.read_header(csv_path: Path) → list[str][source]¶

Reads the header row of a given CSV file.

This function opens a CSV file located at the specified path, reads its first row, and returns it as a list of strings. The file is assumed to be encoded in UTF-8 with optional BOM (Byte Order Mark). If the CSV file is empty, a ValueError is raised indicating the problem. The CSV file is expected to be opened in read mode with no newline translation.

Parameters:: csv_path (Path) – The path to the CSV file to read the header from.
Returns:: A list of strings representing the header row of the CSV file.
Return type:: list[str]
Raises:: ValueError – If the CSV file is empty and no header can be retrieved.

csvsmith.tools.strict_concat.save_csv(rows: Iterable[list[str]], out_path: Path | str) → None[source]¶: Write rows to out_path.

csvsmith.tools.strict_concat.strict_concat_rows(csv_dir: Path | str) → list[list[str]][source]¶

Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.

Parameters:: csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.
Returns:: A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.
Return type:: list[list[str]]
Raises:: FileNotFoundError – If no CSV files are found in the provided directory.