Label Drift#

This notebooks provides an overview for using and understanding label drift check.

Structure:

What Is Label Drift?#

Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model’s performance degrades over time.

Label drift is when drift occurs in the label itself.

For more information on drift, please visit our Drift Guide.

How Deepchecks Detects Label Drift#

This check detects label drift by using univariate measures on the label column.

import pprint

import numpy as np
import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import LabelDrift

Run Check on a Classification Label#

# Generate data:
# --------------

np.random.seed(42)

train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')
col1 col2 target
0 0.496714 -0.138264 1.0
1 0.647689 1.523030 1.0
2 -0.234153 -0.234137 1.0
3 1.579213 0.767435 1.0
4 -0.469474 0.542560 0.0


Run Check#

Label Drift


Run Check on a Regression Label#

# Generate data:
# --------------

train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
#Create drift in test:
df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

Run check#

Label Drift


Add a Condition#

check_cond = LabelDrift().add_condition_drift_score_less_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)
Label Drift


Total running time of the script: (0 minutes 0.275 seconds)

Gallery generated by Sphinx-Gallery