csvsmith.tools

The tools package contains the core logic for csvsmith’s data processing and file management capabilities. Each tool is designed to be usable both as a standalone CLI command and as a reusable Python class or function.

Submodules

class csvsmith.tools.classify.CSVClassifier(source_dir: str | Path, dest_dir: str | Path, signatures: dict[str, list[str]] | None = None, *, mode: str = 'strict', match: str = 'exact', auto: bool = False, dry_run: bool = False, report_only: bool = False, encoding: str = 'utf-8-sig', strip: bool = True, casefold: bool = False, drop_empty: bool = True)[source]

Bases: object

Classifies CSV files into folders based on header signatures.

Two orthogonal controls:
  • mode: “strict” | “relaxed”

  • match: “exact” | “contains” (contains is your legacy behavior)

signatures:
dict[sub, list[sub]]
  • category -> expected columns

  • interpretation depends on match:

    exact: expected columns must match the file header exactly contains: expected columns must be a subset of the file header

rollback(manifest_path: str | Path) None[source]
run() None[source]
class csvsmith.tools.classify.HeaderKey(mode: str, cols: tuple[str, ...])[source]

Bases: object

Hashable header signature.

mode=”strict” -> ordered tuple (col order matters) mode=”relaxed” -> sorted unique tuple (col order does NOT matter)

cols: tuple[str, ...]
mode: str
csvsmith.tools.excel2csv.excel_to_csv(excel_path: str | Path, csv_path: str | Path | None = None, *, sheet_name: str | None = None) Path[source]

Convert one Excel worksheet into a CSV file.

csvsmith.tools.filter_rows.CSVCleaner

alias of DropRowsBySubstring

class csvsmith.tools.filter_rows.DropRowsBySubstring(csv_path: Path | str, column_name: str, unwanted_text: str, *, case_sensitive: bool = True, keep_header: bool = True)[source]

Bases: object

Filter CSV rows by removing rows whose selected column contains a target substring.

FILTERED_SUFFIX = '.filtered.csv'
iter_kept_rows() Generator[list[str], None, None][source]
write_filtered_rows() None[source]
csvsmith.tools.filter_rows.main(csv_path: Path | str, column_name: str, unwanted_text: str) None[source]
csvsmith.tools.find_matches_in_csv.find_matches_in_csv(file_path, target_key, stringency=1.0, **kwargs)[source]

Scans a CSV for a key and returns coordinates and neighbor data.

csvsmith.tools.move_files.move_by_suffix(src_dir: Path | str, dst_dir: Path | str, suffixes: Iterable[str] = {'.csv', '.pdf'}) int[source]

Move files from src_dir to dst_dir when their suffix matches.

Suffix matching is case-insensitive and accepts values with or without a leading dot (for example, "csv" and ".csv" are treated the same).

Parameters:
  • src_dir – Source directory to scan for files.

  • dst_dir – Destination directory where matching files are moved.

  • suffixes – File suffixes to match against.

Returns:

The number of files moved.

csvsmith.tools.move_files.normalize_suffixes(suffixes: Iterable[str]) set[str][source]
csvsmith.tools.row_dedup.add_row_digest(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, colname: str = 'row_digest', inplace: bool = False) list[dict[str, object]][source]

Add a row digest column and return the resulting rows.

csvsmith.tools.row_dedup.dedupe_csv_file(src: Path | str, dst: Path | str, *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', encoding: str = 'utf-8-sig') list[dict[str, object]][source]

Deduplicate a CSV file, write the result, and return the report.

csvsmith.tools.row_dedup.dedupe_with_report(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None, exclude: Sequence[Hashable] | None = None, keep: str = 'first', digest_col: str = 'row_digest') tuple[list[dict[str, object]], list[dict[str, object]]][source]

Drop duplicates and return (deduped_rows, report).

csvsmith.tools.row_dedup.find_duplicate_rows(rows: Sequence[Mapping[str, object]], *, subset: Sequence[Hashable] | None = None) list[dict[str, object]][source]

Return only rows that participate in duplicate groups.

csvsmith.tools.row_dedup.make_row_digest(row: Mapping[str, object], *, columns: Sequence[str]) str[source]

Build a SHA-256 digest for a row using selected columns.

csvsmith.tools.strict_concat.find_csvs(csv_dir: Path | str) list[Path][source]

Find all CSV files in the specified directory.

This function searches for all files with a .csv extension in the given directory and returns a sorted list of their paths.

Parameters:

csv_dir (Path | str) – The directory to search for CSV files. This can be provided as either a Path object or a string representing the path to the directory.

Returns:

Sorted list of paths to all .csv files found in the specified directory.

Return type:

list[Path]

csvsmith.tools.strict_concat.read_header(csv_path: Path) list[str][source]

Reads the header row of a given CSV file.

This function opens a CSV file located at the specified path, reads its first row, and returns it as a list of strings. The file is assumed to be encoded in UTF-8 with optional BOM (Byte Order Mark). If the CSV file is empty, a ValueError is raised indicating the problem. The CSV file is expected to be opened in read mode with no newline translation.

Parameters:

csv_path (Path) – The path to the CSV file to read the header from.

Returns:

A list of strings representing the header row of the CSV file.

Return type:

list[str]

Raises:

ValueError – If the CSV file is empty and no header can be retrieved.

csvsmith.tools.strict_concat.save_csv(rows: Iterable[list[str]], out_path: Path | str) None[source]

Write rows to out_path.

csvsmith.tools.strict_concat.strict_concat_rows(csv_dir: Path | str) list[list[str]][source]

Concatenates rows from multiple CSV files into a list of lists of strings, ensuring the headers across all CSV files match. The output includes a new column indicating the file stem.

Parameters:

csv_dir (Path | str) – Directory containing the CSV files or a specific path to a CSV file.

Returns:

A list of lists, where each inner list represents a row from the concatenated CSV files. The first row contains the headers, including a “file_stem” column.

Return type:

list[list[str]]

Raises:

FileNotFoundError – If no CSV files are found in the provided directory.