csvsmith.utils

The utils package provides foundational utilities used by the tools package. This includes input/output helpers, string normalization, numeric cleaning, and distance-based similarity calculations.

While primarily intended for internal use by csvsmith tools, many of these functions (like clean_numeric and analyze_pair) are useful in general data-processing scripts.

Submodules

csvsmith.utils.clean_numeric.clean_currency_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]

Cleans and converts a currency-prefixed numeric string to a float.

csvsmith.utils.clean_numeric.clean_numeric(value: Any, *, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | Any[source]

Cleans and converts a given input to a float by normalizing its numeric representation.

csvsmith.utils.clean_numeric.strip_currency_prefix(value: Any) Any[source]

Remove a single common currency symbol from the start of a value.

class csvsmith.utils.distance.Relation(*values)[source]

Bases: Enum

CASE_INSENSITIVE_MATCH = 2
EXACT_MATCH = 1
NORMALIZED_SPACE_MATCH = 4
NO_STRUCTURAL_MATCH = 5
WHITESPACE_TRIMMED_MATCH = 3
class csvsmith.utils.distance.Result(classification: 'Relation', damerau_levenshtein_distance: 'int', jaro_winkler_score: 'float', similarity_percentage: 'float')[source]

Bases: object

classification: Relation
damerau_levenshtein_distance: int
get_relation_string() str[source]
jaro_winkler_score: float
similarity_percentage: float
class csvsmith.utils.distance.StringDistance[source]

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:

Possible relationship classification between two strings.

Return damerau_levenshtein_distance:

Integer distance calculated using the Damerau-Levenshtein algorithm.

Return jaro_winkler_score:

A float score indicating similarity using the Jaro-Winkler algorithm.

Return similarity_percentage:

A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) Result[source]
static calculate_damerau_levenshtein(s1: str, s2: str) int[source]

Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) float[source]
static classify(a: str, b: str) Relation[source]
static equals_ignore_case(a: str, b: str) bool[source]
static strip_all(s: str) str[source]

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) str[source]
csvsmith.utils.distance.analyze_pair(a: str, b: str, ignore_case: bool = False) Result[source]
csvsmith.utils.io.count_duplicates_sorted(items: Iterable[Hashable], threshold: int = 2, reverse: bool = True) list[tuple[Hashable, int]][source]

Count items and return those occurring at least threshold times.

csvsmith.utils.io.iter_worksheet_rows(worksheet: Worksheet) Iterable[list[str]][source]

Yield worksheet rows as CSV-ready strings.

csvsmith.utils.io.read_csv_rows(csv_path: Path | str, encoding: str = 'utf-8-sig') list[dict[str, Any]][source]

Read a CSV file into a list of row dictionaries.

csvsmith.utils.io.write_csv_rows(csv_path: Path | str, rows: Sequence[Mapping[str, object]], *, fieldnames: Sequence[str], encoding: str = 'utf-8-sig') None[source]

Write row dictionaries to a CSV file.

csvsmith.utils.io.write_worksheet_to_csv(worksheet: Worksheet, csv_path: str | Path) None[source]

Write worksheet rows to a CSV file.

csvsmith.utils.normalize.normalize(text, ignore_case=True, ignore_whitespace=True, nfkc=True)[source]

Standardizes strings to bypass Excel formatting artifacts.