.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/train_test_validation/plot_train_test_sample_mix.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_train_test_validation_plot_train_test_sample_mix.py: .. _nlp__train_test_samples_mix: Train-Test Samples Mix ************************ This notebook provides an overview for using and understanding the train-test samples mix check: **Structure:** * `Why check for train-test samples mix? <#why-check-for-train-test-samples-mix>`__ * `Create TextData for Train and Test Sets <#create-textdata-for-train-and-test-sets>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ Why check for train-test samples mix? ====================================== The ``TrainTestSamplesMix`` check finds instances of identical or nearly identical (see `text normalization <#with-text-normalization>`__) samples in both the train and test datasets. If such samples are present unintentionally, it may lead to data leakage, which can result in overly optimistic model performance estimates during evaluation. Identifying and addressing such issues is crucial to ensure the model performs well on unseen data. Create TextData for Train and Test Sets ======================================== Let's create train and test datasets with some overlapping and similar text samples. .. GENERATED FROM PYTHON SOURCE LINES 30-51 .. code-block:: default from deepchecks.nlp.checks import TrainTestSamplesMix from deepchecks.nlp import TextData train_texts = [ "Deep learning is a subset of machine learning.", "Deep learning is a subset of machine learning.", "Deep learning is a sub-set of Machine Learning.", "Natural language processing is a subfield of AI.",] test_texts = [ "Deep learning is a subset of machine learning.", "Deep learning is subset of machine learning", "Machine learning is a subfield of AI.", "This is a unique text sample in the test set.", "This is another unique text in the test set.", ] train_dataset = TextData(train_texts) test_dataset = TextData(test_texts) .. GENERATED FROM PYTHON SOURCE LINES 52-54 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 54-66 .. code-block:: default # Run the check without any text normalization check = TrainTestSamplesMix( ignore_case=False, remove_punctuation=False, normalize_unicode=False, remove_stopwords=False, ignore_whitespace=False ) result = check.run(train_dataset, test_dataset) result.show() .. raw:: html
Train Test Samples Mix


.. GENERATED FROM PYTHON SOURCE LINES 67-73 With Text Normalization ----------------------- By default, ``TrainTestSamplesMix`` check applies text normalization before identifying the duplicates. This includes case normalization, punctuation removal, Unicode normalization and stopwords removal. You can also customize the normalization as per your requirements: .. GENERATED FROM PYTHON SOURCE LINES 73-84 .. code-block:: default check = TrainTestSamplesMix( ignore_case=True, remove_punctuation=True, normalize_unicode=True, remove_stopwords=True, ignore_whitespace=True ) result = check.run(train_dataset, test_dataset) result.show() .. raw:: html
Train Test Samples Mix


.. GENERATED FROM PYTHON SOURCE LINES 85-93 Of all the parameters in this example, ``ignore_whitespace`` is the only one set to ``False`` by default. Define a Condition ================== Now, we define a condition that enforces the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 93-98 .. code-block:: default check = TrainTestSamplesMix() check.add_condition_duplicates_ratio_less_or_equal(0) result = check.run(train_dataset, test_dataset) result.show(show_additional_outputs=False) .. raw:: html
Train Test Samples Mix


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.072 seconds) .. _sphx_glr_download_nlp_auto_checks_train_test_validation_plot_train_test_sample_mix.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_sample_mix.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_sample_mix.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_