.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/data_integrity/plot_text_duplicates.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_data_integrity_plot_text_duplicates.py: .. _nlp__data_duplicates: Text Data Duplicates ******************** This notebook provides an overview for using and understanding the text data duplicates check: **Structure:** * `Why check for text data duplicates? <#why-check-for-text-data-duplicates>`__ * `Create TextData <#create-textdata>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ Why check for text data duplicates? =================================== The ``TextDuplicates`` check finds multiple instances of identical or nearly identical (see `text normalization <#with-text-normalization>`__) samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset's nature it has identical-looking samples) this may be valid, however if this is a hidden issue we're not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention. Create TextData =============== Let's create a simple dataset with some duplicate and similar text samples. .. GENERATED FROM PYTHON SOURCE LINES 33-49 .. code-block:: default from deepchecks.nlp.checks import TextDuplicates from deepchecks.nlp import TextData texts = [ "Deep learning is a subset of machine learning.", "Deep learning is a subset of machine learning.", "Deep learning is a sub-set of Machine Learning.", "Deep learning is subset of machine learning", "Natural language processing is a subfield of AI.", "This is a unique text sample.", "This is another unique text.", ] dataset = TextData(texts) .. GENERATED FROM PYTHON SOURCE LINES 50-52 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 52-62 .. code-block:: default # Run the check without any text normalization TextDuplicates( ignore_case=False, remove_punctuation=False, normalize_unicode=False, remove_stopwords=False, ignore_whitespace=False ).run(dataset) .. raw:: html
Text Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 63-68 With Text Normalization ----------------------- By default, ``TextDuplicates`` check applies text normalization before identifying the duplicates. This includes case normalization, punctuation removal, Unicode normalization and stopwords removal. You can also customize the normalization as per your requirements: .. GENERATED FROM PYTHON SOURCE LINES 68-77 .. code-block:: default TextDuplicates( ignore_case=True, remove_punctuation=True, normalize_unicode=True, remove_stopwords=True, ignore_whitespace=True ).run(dataset) .. raw:: html
Text Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 78-86 Of all the parameters in this example, ``ignore_whitespace`` is the only one set to ``False`` by default. Define a Condition ================== Now, we define a condition that enforces the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 86-91 .. code-block:: default check = TextDuplicates() check.add_condition_ratio_less_or_equal(0) result = check.run(dataset) result.show(show_additional_outputs=False) .. raw:: html
Text Duplicates


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.541 seconds) .. _sphx_glr_download_nlp_auto_checks_data_integrity_plot_text_duplicates.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_text_duplicates.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_text_duplicates.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_