FeatureDrift#

class FeatureDrift[source]#

Calculate drift between train dataset and test dataset per feature, using statistical measures.

Check calculates a drift score for each column in test dataset, by comparing its distribution to the train dataset.

For numerical columns, we use the Kolmogorov-Smirnov statistic. See https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test We also support Earth Mover’s Distance (EMD). See https://en.wikipedia.org/wiki/Wasserstein_metric

For categorical distributions, we use the Cramer’s V. See https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V We also support Population Stability Index (PSI). See https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf.

For categorical variables, it is recommended to use Cramer’s V, unless your variable includes categories with a small number of samples (common practice is categories with less than 5 samples). However, in cases of a variable with many categories with few samples, it is still recommended to use Cramer’s V.

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

Columns to check, if none are given checks all columns except ignored ones.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

Columns to ignore, if none given checks based on columns variable.

n_top_columnsint , optional

amount of columns to show ordered by feature importance (date, index, label are first)

sort_feature_bystr , default: “drift + importance”

Indicates how features will be sorted. Possible values: - “feature importance”: sort features by feature importance. - “drift score”: sort features by drift score. - “drift + importance”: sort features by the sum of the drift score and the feature importance.

margin_quantile_filter: float, default: 0.025

float in range [0,0.5), representing which margins (high and low quantiles) of the distribution will be filtered out of the EMD calculation. This is done in order for extreme values not to affect the calculation disproportionally. This filter is applied to both distributions, in both margins.

min_category_size_ratio: float, default 0.01

minimum size ratio for categories. Categories with size ratio lower than this number are binned into an “Other” category.

max_num_categories_for_drift: int, default: None

Only for categorical features. Max number of allowed categories. If there are more, they are binned into an “Other” category. This limit applies for both drift calculation and distribution plots.

max_num_categories_for_display: int, default: 10

Max number of categories to show in plot.

show_categories_by: str, default: ‘largest_difference’

Specify which categories to show for categorical features’ graphs, as the number of shown categories is limited by max_num_categories_for_display. Possible values: - ‘train_largest’: Show the largest train categories. - ‘test_largest’: Show the largest test categories. - ‘largest_difference’: Show the largest difference between categories.

numerical_drift_method: str, default: “KS”

decides which method to use on numerical variables. Possible values are: “EMD” for Earth Mover’s Distance (EMD), “KS” for Kolmogorov-Smirnov (KS).

categorical_drift_method: str, default: “cramers_v”

decides which method to use on categorical variables. Possible values are: “cramers_v” for Cramer’s V, “PSI” for Population Stability Index (PSI).

ignore_na: bool, default True

For categorical columns only. If True, ignores nones for categorical drift. If False, considers none as a separate category. For numerical columns we always ignore nones.

aggregation_method: Optional[str], default: ‘l3_weighted’

Argument for the reduce_output functionality, decides how to aggregate the vector of per-feature scores into a single aggregated score. The aggregated score value is between 0 and 1 for all methods. Possible values are: ‘l3_weighted’: Default. L3 norm over the ‘per-feature scores’ vector weighted by the feature importance, specifically, sum(FI * PER_FEATURE_SCORES^3)^(1/3). This method takes into account the feature importance yet puts more weight on the per-feature scores. This method is recommended for most cases. ‘l5_weighted’: Similar to ‘l3_weighted’, but with L5 norm. Puts even more emphasis on the per-feature scores and specifically on the largest per-feature scores returning a score closer to the maximum among the per-feature scores. ‘weighted’: Weighted mean of per-feature scores based on feature importance. ‘max’: Maximum of all the per-feature scores. None: No averaging. Return a dict with a per-feature score for each feature.

min_samplesint , default: 10

Minimum number of samples required to calculate the drift score. If there are not enough samples for either train or test, the check will return None for that feature. If there are not enough samples for all features, the check will raise a NotEnoughSamplesError exception.

n_samplesint , default: 100_000

Number of samples to use for drift computation and plot.

random_stateint , default: 42

Random seed for sampling.

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: int = 5, sort_feature_by: str = 'drift + importance', margin_quantile_filter: float = 0.025, max_num_categories_for_drift: Optional[int] = None, min_category_size_ratio: float = 0.01, max_num_categories_for_display: int = 10, show_categories_by: str = 'largest_difference', numerical_drift_method: str = 'KS', categorical_drift_method: str = 'cramers_v', ignore_na: bool = True, aggregation_method: Optional[str] = 'l3_weighted', min_samples: Optional[int] = 10, n_samples: int = 100000, random_state: int = 42, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

FeatureDrift.add_condition(name, ...)

Add new condition function to the check.

FeatureDrift.add_condition_drift_score_less_than([...])

Add condition - require drift score to be less than the threshold.

FeatureDrift.clean_conditions()

Remove all conditions from this check instance.

FeatureDrift.conditions_decision(result)

Run conditions on given result.

FeatureDrift.config([include_version, ...])

Return check configuration (conditions' configuration not yet supported).

FeatureDrift.feature_reduce(...)

Return an aggregated drift score based on aggregation method defined.

FeatureDrift.from_config(conf[, version_unmatch])

Return check object from a CheckConfig object.

FeatureDrift.from_json(conf[, version_unmatch])

Deserialize check instance from JSON string.

FeatureDrift.greater_is_better()

Return True if the check reduce_output is better when it is greater.

FeatureDrift.metadata([with_doc_link])

Return check metadata.

FeatureDrift.name()

Name of class in split camel case.

FeatureDrift.params([show_defaults])

Return parameters to show when printing the check.

FeatureDrift.reduce_output(check_result)

Return an aggregated drift score based on aggregation method defined.

FeatureDrift.remove_condition(index)

Remove given condition by index.

FeatureDrift.run(train_dataset, test_dataset)

Run check.

FeatureDrift.run_logic(context)

Calculate drift for all columns.

FeatureDrift.to_json([indent, ...])

Serialize check instance to JSON string.

Examples#