csvsmith.utils¶

The utils package provides foundational utilities used by the tools package. This includes input/output helpers, string normalization, numeric cleaning, and distance-based similarity calculations.

While primarily intended for internal use by csvsmith tools, many of these functions (like clean_numeric and analyze_pair) are useful in general data-processing scripts.

Submodules¶

csvsmith.utils.clean_numeric.clean_currency_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) → float | Any[source]¶: Cleans and converts a currency-prefixed numeric string to a float.

csvsmith.utils.clean_numeric.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) → float | Any[source]¶: Cleans and converts a given input to a float by normalizing its numeric representation.

csvsmith.utils.clean_numeric.strip_currency_prefix(value: Any) → Any[source]¶: Remove a single common currency symbol from the start of a value.

class csvsmith.utils.distance.Relation(*values)[source]¶

Bases: Enum

CASE_INSENSITIVE_MATCH = 2¶

EXACT_MATCH = 1¶

NORMALIZED_SPACE_MATCH = 4¶

NO_STRUCTURAL_MATCH = 5¶

WHITESPACE_TRIMMED_MATCH = 3¶

class csvsmith.utils.distance.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]¶

Bases: object

classification: Relation¶

damerau_levenshtein_distance: int¶

get_relation_string() → str[source]¶

jaro_winkler_score: float¶

similarity_percentage: float¶

class csvsmith.utils.distance.StringDistance[source]¶

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:: Possible relationship classification between two strings.
Return damerau_levenshtein_distance:: Integer distance calculated using the Damerau-Levenshtein algorithm.
Return jaro_winkler_score:: A float score indicating similarity using the Jaro-Winkler algorithm.
Return similarity_percentage:: A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) → Result[source]¶

static calculate_damerau_levenshtein(s1: str, s2: str) → int[source]¶: Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) → float[source]¶

static classify(a: str, b: str) → Relation[source]¶

static equals_ignore_case(a: str, b: str) → bool[source]¶

static strip_all(s: str) → str[source]¶

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) → str[source]¶

csvsmith.utils.distance.analyze_pair(a: str, b: str, ignore_case: bool = False) → Result[source]¶

csvsmith.utils.io.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) → list[tuple[Hashable, int]][source]¶: Count items and return those occurring at least threshold times.

csvsmith.utils.io.iter_worksheet_rows(worksheet: Worksheet) → Iterable[list[str]][source]¶: Yield worksheet rows as CSV-ready strings.

csvsmith.utils.io.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') → list[dict[str, Any]][source]¶: Read a CSV file into a list of row dictionaries.

csvsmith.utils.io.write_csv_rows(csv_path: Path | str, rows: Sequence[Mapping[str, object]], *, fieldnames: Sequence[str], encoding: str = 'utf-8-sig') → None[source]¶: Write row dictionaries to a CSV file.

csvsmith.utils.io.write_worksheet_to_csv(worksheet: Worksheet, csv_path: str | Path) → None[source]¶: Write worksheet rows to a CSV file.

csvsmith.utils.normalize.normalize(text, ignore_case=True, ignore_whitespace=True, nfkc=True)[source]¶: Standardizes strings to bypass Excel formatting artifacts.