.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/train_test_validation/plot_train_test_samples_mix.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_train_test_validation_plot_train_test_samples_mix.py: .. _plot_tabular_train_test_samples_mix: Train Test Samples Mix ********************** This notebook provides an overview for using and understanding the Train Test Samples Mix check. **Structure:** * `Why is samples mix unwanted? <#why-is-samples-mix-unwanted>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ Why is samples mix unwanted? ============================= Samples mix is when the train and test datasets have some samples in common. We use the test dataset in order to evaluate our model performance, and having samples in common with the train dataset will lead to biased metrics, which does not represent the real performance we will get in a real scenario. Therefore, we always want to avoid samples mix. Run the check ============= We will run the check on the iris dataset. .. GENERATED FROM PYTHON SOURCE LINES 26-40 .. code-block:: default from deepchecks.tabular import Dataset from deepchecks.tabular.checks import TrainTestSamplesMix from deepchecks.tabular.datasets.classification import iris # Create data with leakage from train to test train, test = iris.load_data() bad_test_df = test.data.append(train.data.iloc[[0, 1, 1, 2, 3, 4, 2, 2, 10]], ignore_index=True) bad_test = test.copy(bad_test_df) check = TrainTestSamplesMix() result = check.run(test_dataset=bad_test, train_dataset=train) result .. rst-class:: sphx-glr-script-out .. code-block:: none classification_label value for label type is deprecated, allowed task types are multiclass, binary and regression. .. raw:: html
Train Test Samples Mix


.. GENERATED FROM PYTHON SOURCE LINES 41-45 Define a condition ================== We can define a condition that enforces that the ratio of samples in test which appears in train is below a given amount, the default is `0.1`. .. GENERATED FROM PYTHON SOURCE LINES 45-48 .. code-block:: default check = TrainTestSamplesMix().add_condition_duplicates_ratio_less_or_equal() result = check.run(test_dataset=bad_test, train_dataset=train) result.show(show_additional_outputs=False) .. raw:: html
Train Test Samples Mix


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.804 seconds) .. _sphx_glr_download_checks_gallery_tabular_train_test_validation_plot_train_test_samples_mix.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_samples_mix.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_samples_mix.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_