StringMismatchComparison#

class StringMismatchComparison[source]#

Detect different variants of string categories between the same categorical column in two datasets.

This check compares the same categorical column within a dataset and baseline and checks whether there are variants of similar strings that exists only in dataset and not in baseline. Specifically, we define similarity between strings if they are equal when ignoring case and non-letter characters. Example: We have a train dataset with similar strings ‘string’ and ‘St. Ring’, which have different meanings. Our tested dataset has the strings ‘string’, ‘St. Ring’ and a new phrase, ‘st. ring’. Here, we have a new variant of the above strings, and would like to be acknowledged, as this is obviously a different version of ‘St. Ring’.

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

Columns to check, if none are given checks all columns except ignored ones.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

Columns to ignore, if none given checks based on columns variable

n_top_columnsint , optional

amount of columns to show ordered by feature importance (date, index, label are first)

n_samplesint , default: 10_000

number of samples to use for this check.

random_stateint, default: 42

random seed for all check internals.

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: int = 10, n_samples: int = 10000, random_state: int = 42, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

StringMismatchComparison.add_condition(name, ...)

Add new condition function to the check.

StringMismatchComparison.add_condition_no_new_variants()

Add condition - no new variants allowed in test data.

StringMismatchComparison.add_condition_ratio_new_variants_not_greater_than(ratio)

Add condition - no new variants allowed above given percentage in test data.

StringMismatchComparison.clean_conditions()

Remove all conditions from this check instance.

StringMismatchComparison.conditions_decision(result)

Run conditions on given result.

StringMismatchComparison.finalize_check_result(...)

Finalize the check result by adding the check instance and processing the conditions.

StringMismatchComparison.metadata([...])

Return check metadata.

StringMismatchComparison.name()

Name of class in split camel case.

StringMismatchComparison.params([show_defaults])

Return parameters to show when printing the check.

StringMismatchComparison.remove_condition(index)

Remove given condition by index.

StringMismatchComparison.run(train_dataset, ...)

Run check.

StringMismatchComparison.run_logic(context)

Run check.

Examples#