.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/data_integrity/plot_frequent_substrings.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_data_integrity_plot_frequent_substrings.py: .. _nlp__frequent_substrings: Frequent Substrings ******************** This notebook provides an overview for using and understanding the frequent substrings check: **Structure:** * `Why check for frequent substrings? <#why-check-for-frequent-substrings>`__ * `Create TextData <#create-textdata>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ Why check for frequent substrings? =================================== The purpose of the ``FrequentSubstrings`` check is to identify recurring substrings within the Dataset. These commonly occurring substrings can signal potential issues within the data pipeline that demand consideration. Furthermore, these substrings might impact the model's performance and, in certain scenarios, it might be necessary to remove them from the dataset. Substrings of varying lengths (n-grams) are extracted from the dataset text samples. The frequencies of these n-grams are calculated and only substrings exceeding a defined minimum length are retained. The substrings are then sorted by their frequencies and the most frequent substrings are identified. Finally, the substrings with the highest frequency and those surpassing a significance level are displayed. Create TextData =============== Let's create a simple dataset with some frequent substrings. .. GENERATED FROM PYTHON SOURCE LINES 35-50 .. code-block:: default from deepchecks.nlp.checks import FrequentSubstrings from deepchecks.nlp import TextData texts = [ 'Deep learning is a subset of machine learning. Sent from my iPhone', 'Deep learning is a sub-set of Machine Learning.', 'Natural language processing is a subfield of AI. Sent from my iPhone', 'NLP is a subfield of Artificial Intelligence. Sent from my iPhone', 'This is a unique text sample.', 'This is another unique text.' ] dataset = TextData(texts) .. GENERATED FROM PYTHON SOURCE LINES 51-53 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 53-56 .. code-block:: default FrequentSubstrings().run(dataset) .. raw:: html
Frequent Substrings


.. GENERATED FROM PYTHON SOURCE LINES 57-63 Define a Condition ================== Now, we define a condition that enforces that ratio of frequent substrings will be smaller than 0.05 for all frequent substrings in the data. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 63-68 .. code-block:: default check = FrequentSubstrings() check.add_condition_zero_result() result = check.run(dataset) result.show(show_additional_outputs=False) .. raw:: html
Frequent Substrings


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.034 seconds) .. _sphx_glr_download_nlp_auto_checks_data_integrity_plot_frequent_substrings.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_frequent_substrings.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_frequent_substrings.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_