TrainTestFeatureDrift#

class TrainTestFeatureDrift[source]#

Calculate drift between train dataset and test dataset per feature, using statistical measures.

Check calculates a drift score for each column in test dataset, by comparing its distribution to the train dataset. For numerical columns, we use the Earth Movers Distance. See https://en.wikipedia.org/wiki/Wasserstein_metric For categorical columns, we use the Population Stability Index (PSI). See https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

Columns to check, if none are given checks all columns except ignored ones.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

Columns to ignore, if none given checks based on columns variable.

n_top_columnsint , optional

amount of columns to show ordered by feature importance (date, index, label are first)

sort_feature_bystr , default: feature importance

Indicates how features will be sorted. Can be either “feature importance” or “drift score”

margin_quantile_filter: float, default: 0.025

float in range [0,0.5), representing which margins (high and low quantiles) of the distribution will be filtered out of the EMD calculation. This is done in order for extreme values not to affect the calculation disproportionally. This filter is applied to both distributions, in both margins.

max_num_categories_for_drift: int, default: 10

Only for categorical columns. Max number of allowed categories. If there are more, they are binned into an “Other” category. If None, there is no limit.

max_num_categories_for_display: int, default: 10

Max number of categories to show in plot.

show_categories_by: str, default: ‘largest_difference’

Specify which categories to show for categorical features’ graphs, as the number of shown categories is limited by max_num_categories_for_display. Possible values: - ‘train_largest’: Show the largest train categories. - ‘test_largest’: Show the largest test categories. - ‘largest_difference’: Show the largest difference between categories.

n_samplesint , default: 100_000

Number of samples to use for drift computation and plot.

random_stateint , default: 42

Random seed for sampling.

max_num_categories: int, default: None

Deprecated. Please use max_num_categories_for_drift and max_num_categories_for_display instead

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: int = 5, sort_feature_by: str = 'feature importance', margin_quantile_filter: float = 0.025, max_num_categories_for_drift: int = 10, max_num_categories_for_display: int = 10, show_categories_by: str = 'largest_difference', n_samples: int = 100000, random_state: int = 42, max_num_categories: Optional[int] = None, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

TrainTestFeatureDrift.add_condition(name, ...)

Add new condition function to the check.

TrainTestFeatureDrift.add_condition_drift_score_not_greater_than([...])

Add condition - require drift score to not be more than a certain threshold.

TrainTestFeatureDrift.clean_conditions()

Remove all conditions from this check instance.

TrainTestFeatureDrift.conditions_decision(result)

Run conditions on given result.

TrainTestFeatureDrift.finalize_check_result(...)

Finalize the check result by adding the check instance and processing the conditions.

TrainTestFeatureDrift.metadata([with_doc_link])

Return check metadata.

TrainTestFeatureDrift.name()

Name of class in split camel case.

TrainTestFeatureDrift.params([show_defaults])

Return parameters to show when printing the check.

TrainTestFeatureDrift.remove_condition(index)

Remove given condition by index.

TrainTestFeatureDrift.run(train_dataset, ...)

Run check.

TrainTestFeatureDrift.run_logic(context)

Calculate drift for all columns.

Examples#