Train Test Samples Mix#

This notebook provides an overview for using and understanding the Train Test Samples Mix check.

Structure:

Why is samples mix unwanted?#

Samples mix is when the train and test datasets have some samples in common. We use the test dataset in order to evaluate our model performance, and having samples in common with the train dataset will lead to biased metrics, which does not represent the real performance we will get in a real scenario. Therefore, we always want to avoid samples mix.

Run the check#

We will run the check on the iris dataset.

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks.methodology import TrainTestSamplesMix
from deepchecks.tabular.datasets.classification import iris

# Create data with leakage from train to test
train, test = iris.load_data()
bad_test_df = test.data.append(train.data.iloc[[0, 1, 1, 2, 3, 4, 2, 2, 10]], ignore_index=True)
bad_test = test.copy(bad_test_df)

check = TrainTestSamplesMix()
result = check.run(test_dataset=bad_test, train_dataset=train)
result

Out:

/home/runner/work/deepchecks/deepchecks/docs/source/checks/tabular/methodology/plot_train_test_samples_mix.py:31: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Train Test Samples Mix

Detect samples in the test data that appear also in training data.

Additional Outputs
21.28% (10 / 47) of test data samples appear in train data
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
Train indices: 2 Test indices: 41, 44, 45 4.40 2.90 1.40 0.20 0
Train indices: 1 Test indices: 39, 40 4.90 3.00 1.40 0.20 0
Train indices: 4 Test indices: 43 4.90 2.50 4.50 1.70 2
Train indices: 0 Test indices: 38 5.00 2.00 3.50 1.00 1
Train indices: 3 Test indices: 42 5.00 2.30 3.30 1.00 1
Train indices: 30 Test indices: 28 5.80 2.70 5.10 1.90 2
Train indices: 10 Test indices: 46 5.80 4.00 1.20 0.20 0


Define a condition#

We can define a condition that enforces that the ratio of samples in test which appears in train is below a given amount, the default is 0.1.

check = TrainTestSamplesMix().add_condition_duplicates_ratio_not_greater_than()
result = check.run(test_dataset=bad_test, train_dataset=train)
result.show(show_additional_outputs=False)
Train Test Samples Mix


Total running time of the script: ( 0 minutes 1.826 seconds)

Gallery generated by Sphinx-Gallery