.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/methodology/plot_train_test_samples_mix.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_methodology_plot_train_test_samples_mix.py: Train Test Samples Mix ********************** This notebook provides an overview for using and understanding the Train Test Samples Mix check. **Structure:** * `Why is samples mix unwanted? <#why-is-samples-mix-unwanted>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ Why is samples mix unwanted? ============================= Samples mix is when the train and test datasets have some samples in common. We use the test dataset in order to evaluate our model performance, and having samples in common with the train dataset will lead to biased metrics, which does not represent the real performance we will get in a real scenario. Therefore, we always want to avoid samples mix. Run the check ============= We will run the check on the iris dataset. .. GENERATED FROM PYTHON SOURCE LINES 24-38 .. code-block:: default from deepchecks.tabular import Dataset from deepchecks.tabular.checks.methodology import TrainTestSamplesMix from deepchecks.tabular.datasets.classification import iris # Create data with leakage from train to test train, test = iris.load_data() bad_test_df = test.data.append(train.data.iloc[[0, 1, 1, 2, 3, 4, 2, 2, 10]], ignore_index=True) bad_test = test.copy(bad_test_df) check = TrainTestSamplesMix() result = check.run(test_dataset=bad_test, train_dataset=train) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/docs/source/checks/tabular/methodology/plot_train_test_samples_mix.py:31: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. .. raw:: html

Train Test Samples Mix

Detect samples in the test data that appear also in training data.

Additional Outputs
21.28% (10 / 47) of test data samples appear in train data
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
Train indices: 2 Test indices: 41, 44, 45 4.40 2.90 1.40 0.20 0
Train indices: 1 Test indices: 39, 40 4.90 3.00 1.40 0.20 0
Train indices: 4 Test indices: 43 4.90 2.50 4.50 1.70 2
Train indices: 0 Test indices: 38 5.00 2.00 3.50 1.00 1
Train indices: 3 Test indices: 42 5.00 2.30 3.30 1.00 1
Train indices: 30 Test indices: 28 5.80 2.70 5.10 1.90 2
Train indices: 10 Test indices: 46 5.80 4.00 1.20 0.20 0


.. GENERATED FROM PYTHON SOURCE LINES 39-43 Define a condition ================== We can define a condition that enforces that the ratio of samples in test which appears in train is below a given amount, the default is `0.1`. .. GENERATED FROM PYTHON SOURCE LINES 43-46 .. code-block:: default check = TrainTestSamplesMix().add_condition_duplicates_ratio_not_greater_than() result = check.run(test_dataset=bad_test, train_dataset=train) result.show(show_additional_outputs=False) .. raw:: html
Train Test Samples Mix


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.826 seconds) .. _sphx_glr_download_checks_gallery_tabular_methodology_plot_train_test_samples_mix.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_samples_mix.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_samples_mix.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_