String Mismatch

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/data_integrity/plot_string_mismatch.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_data_integrity_plot_string_mismatch.py: .. _plot_tabular_string_mismatch: String Mismatch *************** This notebook provides an overview for using and understanding the "String Mismatch" check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Run check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is the purpose of the check? ================================= String Mismatch works on a single dataset, and it looks for mismatches in each string column in the data. Finding mismatches in strings is helpful for identifying errors in the data. For example, if your data is aggregated from multiple sources, it might have the same values but with a little variation in the formatting, like a leading uppercase. In this case, the model's ability to learn may be impaired since it will see categories that are supposed to be the same, as different categories. How String Mismatch Defined? ---------------------------- To recognize string mismatch, we transform each string to it's base form. The base form is the string with only its alphanumeric characters in lowercase. (For example "Cat-9?!" base form is "cat9"). If two strings have the same base form, they are considered to be the same. .. GENERATED FROM PYTHON SOURCE LINES 35-37 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 37-49 .. code-block:: default import pandas as pd from deepchecks.tabular import Dataset from deepchecks.tabular.checks import StringMismatch data = {'col1': ['Deep', 'deep', 'deep!!!', '$deeP$', 'earth', 'foo', 'bar', 'foo?']} df = pd.DataFrame(data=data) dataset = Dataset(df, cat_features=['col1']) result = StringMismatch().run(dataset) result.show() .. raw:: html

String Mismatch

.. GENERATED FROM PYTHON SOURCE LINES 50-52 Define a Condition ================== .. GENERATED FROM PYTHON SOURCE LINES 52-56 .. code-block:: default check = StringMismatch().add_condition_no_variants() result = check.run(dataset) result.show(show_additional_outputs=False) .. raw:: html

String Mismatch

.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.066 seconds) .. _sphx_glr_download_checks_gallery_tabular_data_integrity_plot_string_mismatch.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_string_mismatch.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_string_mismatch.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_