LabelDrift#

class LabelDrift[source]#

Calculate label drift between train dataset and test dataset, using statistical measures.

Check calculates a drift score for the label in test dataset, by comparing its distribution to the train dataset.

For categorical distributions, we use the Cramer’s V. See https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V We also support Population Stability Index (PSI). See https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf.

For categorical labels, it is recommended to use Cramer’s V, unless your variable includes categories with a small number of samples (common practice is categories with less than 5 samples). However, in cases of a variable with many categories with few samples, it is still recommended to use Cramer’s V.

Parameters
min_category_size_ratio: float, default 0.01

minimum size ratio for categories. Categories with size ratio lower than this number are binned into an “Other” category.

max_num_categories_for_drift: int, default: None

Max number of allowed categories. If there are more, they are binned into an “Other” category. This limit applies for both drift calculation and distribution plots

max_num_categories_for_display: int, default: 10

Max number of categories to show in plot.

show_categories_by: str, default: ‘largest_difference’

Specify which categories to show for categorical features’ graphs, as the number of shown categories is limited by max_num_categories_for_display. Possible values: - ‘train_largest’: Show the largest train categories. - ‘test_largest’: Show the largest test categories. - ‘largest_difference’: Show the largest difference between categories.

numerical_drift_method: str, default: “KS”

decides which method to use on numerical variables. Possible values are: “EMD” for Earth Mover’s Distance (EMD), “KS” for Kolmogorov-Smirnov (KS).

categorical_drift_method: str, default: “cramers_v”

decides which method to use on categorical variables. Possible values are: “cramers_v” for Cramer’s V, “PSI” for Population Stability Index (PSI).

balance_classes: bool, default: False

If True, all categories will have an equal weight in the Cramer’s V score. This is useful when the categorical variable is highly imbalanced, and we want to be alerted on changes in proportion to the category size, and not only to the entire dataset. Must have categorical_drift_method = “cramers_v” and drift_mode = “auto” or “prediction”. If True, the variable frequency plot will be created with a log scale in the y-axis.

ignore_na: bool, default True

For categorical columns only. If True, ignores nones for categorical drift. If False, considers none as a separate category. For numerical columns we always ignore nones.

n_samplesint , default: 100_000

Number of samples to use for drift computation and plot.

__init__(max_num_categories_for_drift: Optional[int] = None, min_category_size_ratio: float = 0.01, max_num_categories_for_display: int = 10, show_categories_by: str = 'largest_difference', numerical_drift_method: str = 'KS', categorical_drift_method: str = 'cramers_v', balance_classes: bool = False, ignore_na: bool = True, n_samples: int = 100000, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

LabelDrift.add_condition(name, ...)

Add new condition function to the check.

LabelDrift.add_condition_drift_score_less_than([...])

Add condition - require drift score to be less than the threshold.

LabelDrift.clean_conditions()

Remove all conditions from this check instance.

LabelDrift.conditions_decision(result)

Run conditions on given result.

LabelDrift.config([include_version, ...])

Return check configuration (conditions' configuration not yet supported).

LabelDrift.from_config(conf[, version_unmatch])

Return check object from a CheckConfig object.

LabelDrift.from_json(conf[, version_unmatch])

Deserialize check instance from JSON string.

LabelDrift.metadata([with_doc_link])

Return check metadata.

LabelDrift.name()

Name of class in split camel case.

LabelDrift.params([show_defaults])

Return parameters to show when printing the check.

LabelDrift.remove_condition(index)

Remove given condition by index.

LabelDrift.run(train_dataset, test_dataset)

Run check.

LabelDrift.run_logic(context)

Calculate drift for the label.

LabelDrift.to_json([indent, ...])

Serialize check instance to JSON string.