csvsmith.utils¶
The utils package provides foundational utilities used by the tools package. This includes input/output helpers, string normalization, numeric cleaning, and distance-based similarity calculations.
While primarily intended for internal use by csvsmith tools, many of these functions (like clean_numeric and analyze_pair) are useful in general data-processing scripts.
Submodules¶
- csvsmith.utils.clean_numeric.clean_currency_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]¶
Cleans and converts a currency-prefixed numeric string to a float.
- csvsmith.utils.clean_numeric.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]¶
Cleans and converts a given input to a float by normalizing its numeric representation.
- csvsmith.utils.clean_numeric.strip_currency_prefix(value: Any) Any[source]¶
Remove a single common currency symbol from the start of a value.
- class csvsmith.utils.distance.Relation(*values)[source]¶
Bases:
Enum- CASE_INSENSITIVE_MATCH = 2¶
- EXACT_MATCH = 1¶
- NORMALIZED_SPACE_MATCH = 4¶
- NO_STRUCTURAL_MATCH = 5¶
- WHITESPACE_TRIMMED_MATCH = 3¶
- class csvsmith.utils.distance.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]¶
Bases:
object- damerau_levenshtein_distance: int¶
- jaro_winkler_score: float¶
- similarity_percentage: float¶
- class csvsmith.utils.distance.StringDistance[source]¶
Bases:
objectProvides functionality for calculating string distances and relationships between two strings based on various algorithms.
This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.
- Return classification:
Possible relationship classification between two strings.
- Return damerau_levenshtein_distance:
Integer distance calculated using the Damerau-Levenshtein algorithm.
- Return jaro_winkler_score:
A float score indicating similarity using the Jaro-Winkler algorithm.
- Return similarity_percentage:
A percentage similarity score between two strings.
- static calculate_damerau_levenshtein(s1: str, s2: str) int[source]¶
Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.
- csvsmith.utils.io.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) list[tuple[Hashable, int]][source]¶
Count items and return those occurring at least threshold times.
- csvsmith.utils.io.iter_worksheet_rows(worksheet: Worksheet) Iterable[list[str]][source]¶
Yield worksheet rows as CSV-ready strings.
- csvsmith.utils.io.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') list[dict[str, Any]][source]¶
Read a CSV file into a list of row dictionaries.
- csvsmith.utils.io.write_csv_rows(csv_path: Path | str, rows: Sequence[Mapping[str, object]], *, fieldnames: Sequence[str], encoding: str = 'utf-8-sig') None[source]¶
Write row dictionaries to a CSV file.