Train Test Prediction Drift#

This notebooks provides an overview for using and understanding the tabular prediction drift check.

Structure:

What Is Prediction Drift?#

Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model’s performance degrades over time.

Prediction drift is when drift occurs in the prediction itself. Calculating prediction drift is especially useful in cases in which labels are not available for the test dataset, and so a drift in the predictions is our only indication that a changed has happened in the data that actually affects model predictions. If labels are available, it’s also recommended to run the Label Drift check.

For more information on drift, please visit our drift guide.

How Deepchecks Detects Prediction Drift#

This check detects prediction drift by using univariate measures on the prediction output.

from sklearn.preprocessing import LabelEncoder

from deepchecks.tabular.checks import TrainTestPredictionDrift
from deepchecks.tabular.datasets.classification import adult

Generate data#

label_name = 'income'
train_ds, test_ds = adult.load_data()

Introducing drift:

test_ds.data['education-num'] = 13
test_ds.data['education'] = ' Bachelors'

Build Model#

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
numeric_transformer = SimpleImputer()
categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OrdinalEncoder())]
)

train_ds.features
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, train_ds.numerical_features),
        ("cat", categorical_transformer, train_ds.cat_features),
    ]
)

model = Pipeline(steps=[("preprocessing", preprocessor), ("model", RandomForestClassifier(max_depth=5, n_jobs=-1))])
model = model.fit(train_ds.data[train_ds.features], train_ds.data[train_ds.label_name])

Run check#

check = TrainTestPredictionDrift()
result = check.run(train_dataset=train_ds, test_dataset=test_ds, model=model)
result
Train Test Prediction Drift


The prediction drift check can also calculate drift on the predicted classes rather than the probabilities. This is the default behavior for multiclass tasks. To force this behavior for binary tasks, set the drift_mode parameter to prediction.

check = TrainTestPredictionDrift(drift_mode='prediction')
result = check.run(train_dataset=train_ds, test_dataset=test_ds, model=model)
result
Train Test Prediction Drift


Total running time of the script: ( 0 minutes 6.736 seconds)

Gallery generated by Sphinx-Gallery