StringMismatch#

class StringMismatch[source]#

Detect different variants of string categories (e.g. “mislabeled” vs “mis-labeled”) in a categorical column.

This check tests all the categorical columns within a dataset and search for variants of similar strings. Specifically, we define similarity between strings if they are equal when ignoring case and non-letter characters. Example: We have a column with similar strings ‘OK’ and ‘ok.’ which are variants of the same category. Knowing they both exist we can fix our data so it will have only one category.

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

Columns to check, if none are given checks all columns except ignored ones.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

Columns to ignore, if none given checks based on columns variable

n_top_columnsint , optional

amount of columns to show ordered by feature importance (date, index, label are first)

n_samplesint , default: 1_000_000

number of samples to use for this check.

random_stateint, default: 42

random seed for all check internals.

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: int = 10, n_samples: int = 1000000, random_state: int = 42, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

StringMismatch.add_condition(name, ...)

Add new condition function to the check.

StringMismatch.add_condition_no_variants()

Add condition - no variants are allowed.

StringMismatch.add_condition_number_variants_less_or_equal(...)

Add condition - number of variants (per string baseform) is less or equal to threshold.

StringMismatch.add_condition_ratio_variants_less_or_equal([...])

Add condition - percentage of variants in data is less or equal to threshold.

StringMismatch.clean_conditions()

Remove all conditions from this check instance.

StringMismatch.conditions_decision(result)

Run conditions on given result.

StringMismatch.config([include_version])

Return check configuration (conditions' configuration not yet supported).

StringMismatch.from_config(conf[, ...])

Return check object from a CheckConfig object.

StringMismatch.from_json(conf[, version_unmatch])

Deserialize check instance from JSON string.

StringMismatch.metadata([with_doc_link])

Return check metadata.

StringMismatch.name()

Name of class in split camel case.

StringMismatch.params([show_defaults])

Return parameters to show when printing the check.

StringMismatch.remove_condition(index)

Remove given condition by index.

StringMismatch.run(dataset[, model, ...])

Run check.

StringMismatch.run_logic(context, dataset_kind)

Run check.

StringMismatch.to_json([indent])

Serialize check instance to JSON string.

Examples#