Whole Dataset Drift

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/train_test_validation/plot_whole_dataset_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_train_test_validation_plot_whole_dataset_drift.py: Whole Dataset Drift ******************* This notebooks provides an overview for using and understanding the whole dataset drift check. **Structure:** * `What Is Multivariate Drift? <#what-is-a-multivariate-drift>`__ * `Loading the Data <#loading-the-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ What Is Multivariate Drift? ============================== Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model's performance degrades over time. A multivariate drift is a drift that occurs in more than one feature at a time, and may even affect the relationships between those features, which are undetectable by univariate drift methods. The whole dataset drift check tries to detect multivariate drift between the two input datasets. For more information on drift, please visit our :doc:`drift guide `. How Deepchecks Detects Dataset Drift ------------------------------------ This check detects multivariate drift by using :ref:`a domain classifier `. Other methods to detect drift include :ref:`univariate measures ` which is used in other checks, such as :doc:`Train Test Feature Drift check `. .. GENERATED FROM PYTHON SOURCE LINES 39-45 Loading the Data ================ The dataset is the adult dataset which can be downloaded from the UCI machine learning repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. .. GENERATED FROM PYTHON SOURCE LINES 45-55 .. code-block:: default from urllib.request import urlopen import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from deepchecks.tabular import Dataset from deepchecks.tabular.datasets.classification import adult .. GENERATED FROM PYTHON SOURCE LINES 56-58 Create Dataset ============== .. GENERATED FROM PYTHON SOURCE LINES 58-65 .. code-block:: default label_name = 'income' train_ds, test_ds = adult.load_data() encoder = LabelEncoder() train_ds.data[label_name] = encoder.fit_transform(train_ds.data[label_name]) test_ds.data[label_name] = encoder.transform(test_ds.data[label_name]) .. GENERATED FROM PYTHON SOURCE LINES 66-69 .. code-block:: default train_ds.label_name .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 'income' .. GENERATED FROM PYTHON SOURCE LINES 70-72 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 72-77 .. code-block:: default from deepchecks.tabular.checks import WholeDatasetDrift check = WholeDatasetDrift() check.run(train_dataset=train_ds, test_dataset=test_ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Calculating permutation feature importance. Expected to finish in 4 seconds .. raw:: html

Whole Dataset Drift

.. GENERATED FROM PYTHON SOURCE LINES 78-86 We can see that there is almost no drift found between the train and the test set of the raw adult dataset. In addition to the drift score the check displays the top features that contibuted to the data drift. Introduce drift to dataset ========================== Now, let's try to add a manual data drift to the data by sampling a biased portion of the training data .. GENERATED FROM PYTHON SOURCE LINES 86-90 .. code-block:: default sample_size = 10000 random_seed = 0 .. GENERATED FROM PYTHON SOURCE LINES 91-99 .. code-block:: default train_drifted_df = pd.concat([train_ds.data.sample(min(sample_size, train_ds.n_samples) - 5000, random_state=random_seed), train_ds.data[train_ds.data['sex'] == ' Female'].sample(5000, random_state=random_seed)]) test_drifted_df = test_ds.data.sample(min(sample_size, test_ds.n_samples), random_state=random_seed) train_drifted_ds = Dataset(train_drifted_df, label=label_name, cat_features=train_ds.cat_features) test_drifted_ds = Dataset(test_drifted_df, label=label_name, cat_features=test_ds.cat_features) .. GENERATED FROM PYTHON SOURCE LINES 100-104 .. code-block:: default check = WholeDatasetDrift() check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Calculating permutation feature importance. Expected to finish in 5 seconds .. raw:: html

Whole Dataset Drift

.. GENERATED FROM PYTHON SOURCE LINES 105-115 As expected, the check detects a multivariate drift between the train and the test sets. It also displays the sex feature's distribution - the feature that contributed the most to that drift. This is reasonable since the sampling was biased based on that feature. Define a Condition ================== Now, we define a condition that enforce the whole dataset drift score must be below 0.1. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 115-120 .. code-block:: default check = WholeDatasetDrift() check.add_condition_overall_drift_value_not_greater_than(0.1) check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Calculating permutation feature importance. Expected to finish in 4 seconds .. raw:: html

Whole Dataset Drift

.. GENERATED FROM PYTHON SOURCE LINES 121-122 As we see, our condition successfully detects the drift score is above the defined threshold. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 10.476 seconds) .. _sphx_glr_download_checks_gallery_tabular_train_test_validation_plot_whole_dataset_drift.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_whole_dataset_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_whole_dataset_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_