Multivariate Drift

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/train_test_validation/plot_multivariate_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_train_test_validation_plot_multivariate_drift.py: .. _plot_tabular_multivariate_drift: Multivariate Drift ******************* This notebooks provides an overview for using and understanding the multivariate drift check. **Structure:** * `What Is Multivariate Drift? <#what-is-a-multivariate-drift>`__ * `Loading the Data <#loading-the-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ What Is Multivariate Drift? ============================== Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model's performance degrades over time. A multivariate drift is a drift that occurs in more than one feature at a time, and may even affect the relationships between those features, which are undetectable by univariate drift methods. The multivariate drift check tries to detect multivariate drift between the two input datasets. For more information on drift, please visit our :doc:`drift guide `. How Deepchecks Detects Dataset Drift ------------------------------------ This check detects multivariate drift by using :ref:`a domain classifier `. Other methods to detect drift include :ref:`univariate measures ` which is used in other checks, such as :doc:`Train Test Feature Drift check `. .. GENERATED FROM PYTHON SOURCE LINES 41-47 Loading the Data ================ The dataset is the adult dataset which can be downloaded from the UCI machine learning repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. .. GENERATED FROM PYTHON SOURCE LINES 47-57 .. code-block:: default from urllib.request import urlopen import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from deepchecks.tabular import Dataset from deepchecks.tabular.datasets.classification import adult .. GENERATED FROM PYTHON SOURCE LINES 58-60 Create Dataset ============== .. GENERATED FROM PYTHON SOURCE LINES 60-67 .. code-block:: default label_name = 'income' train_ds, test_ds = adult.load_data() encoder = LabelEncoder() train_ds.data[label_name] = encoder.fit_transform(train_ds.data[label_name]) test_ds.data[label_name] = encoder.transform(test_ds.data[label_name]) .. GENERATED FROM PYTHON SOURCE LINES 68-71 .. code-block:: default train_ds.label_name .. rst-class:: sphx-glr-script-out .. code-block:: none 'income' .. GENERATED FROM PYTHON SOURCE LINES 72-74 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 74-79 .. code-block:: default from deepchecks.tabular.checks import MultivariateDrift check = MultivariateDrift() check.run(train_dataset=train_ds, test_dataset=test_ds) .. raw:: html

Multivariate Drift

.. GENERATED FROM PYTHON SOURCE LINES 80-88 We can see that there is almost no drift found between the train and the test set of the raw adult dataset. In addition to the drift score the check displays the top features that contibuted to the data drift. Introduce drift to dataset ========================== Now, let's try to add a manual data drift to the data by sampling a biased portion of the training data .. GENERATED FROM PYTHON SOURCE LINES 88-92 .. code-block:: default sample_size = 10000 random_seed = 0 .. GENERATED FROM PYTHON SOURCE LINES 93-101 .. code-block:: default train_drifted_df = pd.concat([train_ds.data.sample(min(sample_size, train_ds.n_samples) - 5000, random_state=random_seed), train_ds.data[train_ds.data['sex'] == ' Female'].sample(5000, random_state=random_seed)]) test_drifted_df = test_ds.data.sample(min(sample_size, test_ds.n_samples), random_state=random_seed) train_drifted_ds = Dataset(train_drifted_df, label=label_name, cat_features=train_ds.cat_features) test_drifted_ds = Dataset(test_drifted_df, label=label_name, cat_features=test_ds.cat_features) .. GENERATED FROM PYTHON SOURCE LINES 102-106 .. code-block:: default check = MultivariateDrift() check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds) .. raw:: html

Multivariate Drift

.. GENERATED FROM PYTHON SOURCE LINES 107-117 As expected, the check detects a multivariate drift between the train and the test sets. It also displays the sex feature's distribution - the feature that contributed the most to that drift. This is reasonable since the sampling was biased based on that feature. Define a Condition ================== Now, we define a condition that enforce the multivariate drift score must be below 0.1. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 117-122 .. code-block:: default check = MultivariateDrift() check.add_condition_overall_drift_value_less_than(0.1) check.run(train_dataset=train_drifted_ds, test_dataset=test_drifted_ds) .. raw:: html

Multivariate Drift

.. GENERATED FROM PYTHON SOURCE LINES 123-124 As we see, our condition successfully detects the drift score is above the defined threshold. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.468 seconds) .. _sphx_glr_download_checks_gallery_tabular_train_test_validation_plot_multivariate_drift.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_multivariate_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_multivariate_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_