.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/distribution/plot_train_test_feature_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_distribution_plot_train_test_feature_drift.py: Train Test Feature Drift ************************ This notebooks provides an overview for using and understanding feature drift check. **Structure:** * `What is a feature drift? <#what-is-a-feature-drift>`__ * `Generate data & model <#generate-data-model>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is a feature drift? ======================== Data drift is simply a change in the distribution of data over time. It is also one of the top reasons of a machine learning model performance degrades over time. Causes of data drift include: * Upstream process changes, such as a sensor being replaced that changes the units of measurement from inches to centimeters. * Data quality issues, such as a broken sensor always reading 0. * Natural drift in the data, such as mean temperature changing with the seasons. * Change in relation between features, or covariate shift. Feature drift is such drift in a single feature in the dataset. In the context of machine learning, drift between the training set and the test set will likely make the model to be prone to errors. In other words, this means that the model was trained on data that is different from the current test data, thus it will probably make more mistakes predicting the target variable. How deepchecks detects feature drift ------------------------------------ There are many methods to detect feature drift. Some of them include training a classifier that detects which samples come from a known distribution and defines the drift by the accuracy of this classifier. For more information, refer to the :doc:`Whole Dataset Drift check `. Other approaches include statistical methods aim to measure difference between distribution of 2 given sets. We exprimented with various approaches and found that for detecting drift in a single feature, the following 2 methods give the best results: * `Population Stability Index (PSI) `__ * `Wasserstein metric (Earth Movers Distance) `__ For numerical features, the check uses the Earth Movers Distance method and for the categorical features it uses the PSI. The check calculates drift between train dataset and test dataset per feature, using these 2 statistical measures. .. GENERATED FROM PYTHON SOURCE LINES 60-63 Generate data & model ===================== Let's generate a mock dataset of 2 categorical and 2 numerical features .. GENERATED FROM PYTHON SOURCE LINES 63-78 .. code-block:: default import numpy as np import pandas as pd np.random.seed(42) train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=['apple', 'orange', 'banana'], p=[0.5, 0.3, 0.2], size=(1000, 2))], axis=1) test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=['apple', 'orange', 'banana'], p=[0.5, 0.3, 0.2], size=(1000, 2))], axis=1) df_train = pd.DataFrame(train_data, columns=['numeric_without_drift', 'numeric_with_drift', 'categorical_without_drift', 'categorical_with_drift']) df_test = pd.DataFrame(test_data, columns=df_train.columns) df_train = df_train.astype({'numeric_without_drift': 'float', 'numeric_with_drift': 'float'}) df_test = df_test.astype({'numeric_without_drift': 'float', 'numeric_with_drift': 'float'}) .. GENERATED FROM PYTHON SOURCE LINES 79-82 .. code-block:: default df_train.head() .. raw:: html
numeric_without_drift numeric_with_drift categorical_without_drift categorical_with_drift
0 0.496714 -0.138264 apple apple
1 0.647689 1.523030 apple apple
2 -0.234153 -0.234137 banana banana
3 1.579213 0.767435 apple banana
4 -0.469474 0.542560 orange apple

.. GENERATED FROM PYTHON SOURCE LINES 83-86 Insert drift into test: ----------------------- Now, we insert a synthetic drift into 2 columns in the dataset .. GENERATED FROM PYTHON SOURCE LINES 86-90 .. code-block:: default df_test['numeric_with_drift'] = df_test['numeric_with_drift'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4 df_test['categorical_with_drift'] = np.random.choice(a=['apple', 'orange', 'banana', 'lemon'], p=[0.5, 0.25, 0.15, 0.1], size=(1000, 1)) .. GENERATED FROM PYTHON SOURCE LINES 91-96 Training a model ---------------- Now, we are building a dummy model (the label is just a random numerical column). We preprocess our synthetic dataset so categorical features are being encoded with an OrdinalEncoder .. GENERATED FROM PYTHON SOURCE LINES 96-104 .. code-block:: default from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import OrdinalEncoder from sklearn.tree import DecisionTreeClassifier from deepchecks.tabular import Dataset .. GENERATED FROM PYTHON SOURCE LINES 105-121 .. code-block:: default model = Pipeline([ ('handle_cat', ColumnTransformer( transformers=[ ('num', 'passthrough', ['numeric_with_drift', 'numeric_without_drift']), ('cat', Pipeline([ ('encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)), ]), ['categorical_with_drift', 'categorical_without_drift']) ] )), ('model', DecisionTreeClassifier(random_state=0, max_depth=2))] ) .. GENERATED FROM PYTHON SOURCE LINES 122-134 .. code-block:: default label = np.random.randint(0, 2, size=(df_train.shape[0],)) cat_features = ['categorical_without_drift', 'categorical_with_drift'] df_train['target'] = label train_dataset = Dataset(df_train, label='target', cat_features=cat_features) model.fit(train_dataset.data[train_dataset.features], label) label = np.random.randint(0, 2, size=(df_test.shape[0],)) df_test['target'] = label test_dataset = Dataset(df_test, label='target', cat_features=cat_features) .. GENERATED FROM PYTHON SOURCE LINES 135-138 Run the check ============= Let's run deepchecks' feature drift check and see the results .. GENERATED FROM PYTHON SOURCE LINES 138-145 .. code-block:: default from deepchecks.tabular.checks import TrainTestFeatureDrift check = TrainTestFeatureDrift() result = check.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead Calculating permutation feature importance. Expected to finish in 1 seconds .. raw:: html

Train Test Drift

Calculate drift between train dataset and test dataset per feature, using statistical measures.

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the features, sorted by feature importance and showing only the top 5 features, according to feature importance.
If available, the plot titles also show the feature importance (FI) rank.

.. GENERATED FROM PYTHON SOURCE LINES 146-157 Observe the check's output -------------------------- As we see from the results, the check detects and returns the drift score per feature. As we expect, the features that were manually manipulated to contain a strong drift in them were detected. In addition to the graphs, each check returns a value that can be controlled in order to define expectations on that value (for example, to define that the drift score for every feature must be below 0.05). Let's see the result value for our check .. GENERATED FROM PYTHON SOURCE LINES 157-160 .. code-block:: default result.value .. rst-class:: sphx-glr-script-out Out: .. code-block:: none OrderedDict([('numeric_without_drift', {'Drift score': 0.019594833552359095, 'Method': "Earth Mover's Distance", 'Importance': 0.6911764705882353}), ('numeric_with_drift', {'Drift score': 0.3430867349314306, 'Method': "Earth Mover's Distance", 'Importance': 0.3088235294117647}), ('categorical_without_drift', {'Drift score': 0.004109630273978716, 'Method': 'PSI', 'Importance': 0.0}), ('categorical_with_drift', {'Drift score': 0.22343755359099068, 'Method': 'PSI', 'Importance': 0.0})]) .. GENERATED FROM PYTHON SOURCE LINES 161-169 Define a condition ================== As we can see, we get the drift score for each feature in the dataset, along with the feature importance in respect to the model. Now, we define a condition that enforce each feature's drift score must be below 0.1. A condition is deepchecks' way to enforce that results are OK, and we don't have a problem in our data or model! .. GENERATED FROM PYTHON SOURCE LINES 169-173 .. code-block:: default check_cond = check.add_condition_drift_score_not_greater_than(max_allowed_psi_score=0.2, max_allowed_earth_movers_score=0.1) .. GENERATED FROM PYTHON SOURCE LINES 174-178 .. code-block:: default result = check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset) result.show(show_additional_outputs=False) .. raw:: html
Train Test Drift

.. GENERATED FROM PYTHON SOURCE LINES 179-181 As we see, our condition successfully detects and filters the problematic features that contains a drift! .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.546 seconds) .. _sphx_glr_download_checks_gallery_tabular_distribution_plot_train_test_feature_drift.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_feature_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_feature_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_