Train Test Label Drift#

This notebooks provides an overview for using and understanding label drift check.

Structure:

What Is Label Drift?
Run Check on a Classification Label
Run Check on a Regression Label
Add a Condition

What Is Label Drift?#

Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model’s performance degrades over time.

Label drift is when drift occurs in the label itself.

For more information on drift, please visit our drift guide.

How Deepchecks Detects Label Drift#

This check detects label drift by using univariate measures on the label column.

import pprint

import numpy as np
import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import TrainTestLabelDrift

Run Check on a Classification Label#

# Generate data:
# --------------

np.random.seed(42)

train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

Out:

It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred

df_train.head()

	col1	col2	target
0	0.496714	-0.138264	1.0
1	0.647689	1.523030	1.0
2	-0.234153	-0.234137	1.0
3	1.579213	0.767435	1.0
4	-0.469474	0.542560	0.0

Run Check#

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Run Check on a Regression Label#

# Generate data:
# --------------

train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
#Create drift in test:
df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

Out:

It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred

Run check#

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Add a Condition#

check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)

Train Test Label Drift

Total running time of the script: ( 0 minutes 0.360 seconds)

Gallery generated by Sphinx-Gallery

Train Test Feature Drift

Train Test Samples Mix