.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/distribution/plot_train_test_prediction_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_distribution_plot_train_test_prediction_drift.py: Train Test Prediction Drift *************************** This notebooks provides an overview for using and understanding the tabular prediction drift check. **Structure:** * `What is prediction drift? <#what-is-prediction-drift>`__ * `Generate Data <#generate-data>`__ * `Build Model <#build-model>`__ * `Run check <#run-check>`__ What Is Prediction Drift? =========================== The term drift (and all it's derivatives) is used to describe any change in the data compared to the data the model was trained on. Prediction drift refers to the case in which a change in the data (data/feature drift) has happened and as a result, the distribution of the models' prediction has changed. Calculating prediction drift is especially useful in cases in which labels are not available for the test dataset, and so a drift in the predictions is our only indication that a changed has happened in the data that actually affects model predictions. If labels are available, it's also recommended to run the `Label Drift Check `__. There are two main causes for prediction drift: * A change in the sample population. In this case, the underline phenomenon we're trying to predict behaves the same, but we're not getting the same types of samples. For example, Iris Virginica stops growing and is not being predicted by the model trained to classify Iris species. * Concept drift, which means that the underline relation between the data and the label has changed. For example, we're trying to predict income based on food spending, but ongoing inflation effect prices. It's important to note that concept drift won't necessarily result in prediction drift, unless it affects features that are of high importance to the model. How Does the TrainTestPredictionDrift Check Work? ================================================= There are many methods to detect drift, that usually include statistical methods that aim to measure difference between 2 distributions. We experimented with various approaches and found that for detecting drift between 2 one-dimensional distributions, the following 2 methods give the best results: * For regression problems, the `Population Stability Index (PSI) `__ * For classification problems, the `Wasserstein Distance (Earth Mover's Distance) `__ .. GENERATED FROM PYTHON SOURCE LINES 52-59 .. code-block:: default from sklearn.preprocessing import LabelEncoder from deepchecks.tabular.checks import TrainTestPredictionDrift from deepchecks.tabular.datasets.classification import adult .. GENERATED FROM PYTHON SOURCE LINES 60-62 Generate data ============= .. GENERATED FROM PYTHON SOURCE LINES 62-69 .. code-block:: default label_name = 'income' train_ds, test_ds = adult.load_data() encoder = LabelEncoder() train_ds.data[label_name] = encoder.fit_transform(train_ds.data[label_name]) test_ds.data[label_name] = encoder.transform(test_ds.data[label_name]) .. GENERATED FROM PYTHON SOURCE LINES 70-71 Introducing drift: .. GENERATED FROM PYTHON SOURCE LINES 71-76 .. code-block:: default test_ds.data['education-num'] = 13 test_ds.data['education'] = ' Bachelors' .. GENERATED FROM PYTHON SOURCE LINES 77-79 Build Model =========== .. GENERATED FROM PYTHON SOURCE LINES 79-87 .. code-block:: default from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.preprocessing import OrdinalEncoder .. GENERATED FROM PYTHON SOURCE LINES 88-106 .. code-block:: default numeric_transformer = SimpleImputer() categorical_transformer = Pipeline( steps=[("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OrdinalEncoder())] ) train_ds.features preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, train_ds.numerical_features), ("cat", categorical_transformer, train_ds.cat_features), ] ) model = Pipeline(steps=[("preprocessing", preprocessor), ("model", RandomForestClassifier(max_depth=5, n_jobs=-1))]) model = model.fit(train_ds.data[train_ds.features], train_ds.data[train_ds.label_name]) .. GENERATED FROM PYTHON SOURCE LINES 107-109 Run check ========= .. GENERATED FROM PYTHON SOURCE LINES 109-113 .. code-block:: default check = TrainTestPredictionDrift() result = check.run(train_dataset=train_ds, test_dataset=test_ds, model=model) result .. raw:: html

Train Test Prediction Drift

Calculate prediction drift between train dataset and test dataset, using statistical measures.

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the predictions.


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.451 seconds) .. _sphx_glr_download_checks_gallery_tabular_distribution_plot_train_test_prediction_drift.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_prediction_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_prediction_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_