.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/distribution/plot_train_test_label_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_distribution_plot_train_test_label_drift.py: Train Test Label Drift ********************** .. GENERATED FROM PYTHON SOURCE LINES 8-17 .. code-block:: default import pprint import numpy as np import pandas as pd from deepchecks.tabular import Dataset from deepchecks.tabular.checks import TrainTestLabelDrift .. GENERATED FROM PYTHON SOURCE LINES 18-20 Generate data - Classification label ==================================== .. GENERATED FROM PYTHON SOURCE LINES 20-33 .. code-block:: default np.random.seed(42) train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1) #Create test_data with drift in label: test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1) df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target']) df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target']) train_dataset = Dataset(df_train, label='target') test_dataset = Dataset(df_test, label='target') .. GENERATED FROM PYTHON SOURCE LINES 34-37 .. code-block:: default df_train.head() .. raw:: html
col1 col2 target
0 0.496714 -0.138264 1.0
1 0.647689 1.523030 1.0
2 -0.234153 -0.234137 1.0
3 1.579213 0.767435 1.0
4 -0.469474 0.542560 0.0


.. GENERATED FROM PYTHON SOURCE LINES 38-40 Run Check ========= .. GENERATED FROM PYTHON SOURCE LINES 40-45 .. code-block:: default check = TrainTestLabelDrift() result = check.run(train_dataset=train_dataset, test_dataset=test_dataset) result .. raw:: html

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures.

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.


.. GENERATED FROM PYTHON SOURCE LINES 46-48 Generate data - Regression label ================================ .. GENERATED FROM PYTHON SOURCE LINES 48-60 .. code-block:: default train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1) test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1) df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target']) df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target']) #Create drift in test: df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4 train_dataset = Dataset(df_train, label='target') test_dataset = Dataset(df_test, label='target') .. GENERATED FROM PYTHON SOURCE LINES 61-63 Run check ========= .. GENERATED FROM PYTHON SOURCE LINES 63-68 .. code-block:: default check = TrainTestLabelDrift() result = check.run(train_dataset=train_dataset, test_dataset=test_dataset) result .. raw:: html

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures.

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.


.. GENERATED FROM PYTHON SOURCE LINES 69-70 Add condition .. GENERATED FROM PYTHON SOURCE LINES 70-73 .. code-block:: default check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than() check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset) .. raw:: html

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures.

Conditions Summary
Status Condition More Info
PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift Label's Earth Mover's Distance above threshold: 0.34
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.134 seconds) .. _sphx_glr_download_checks_gallery_tabular_distribution_plot_train_test_label_drift.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_label_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_label_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_