.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/integrity/plot_string_mismatch_comparison.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_integrity_plot_string_mismatch_comparison.py: String Mismatch Comparison ************************** This page provides an overview for using and understanding the "String Mismatch Comparison" check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Run check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is the purpose of the check? ================================= The check compares the same categorical column within train and test and checks whether there are variants of similar strings that exists only in test and not in train. Finding those mismatches is helpful to prevent errors when inferring on the test data. For example, in train data we have category 'New York', and in our test data we have 'new york'. We would like to be acknowledged that the test data contain a new variant of the train data, so we can address the problem. How String Mismatch Defined? ---------------------------- To recognize string mismatch, we transform each string to it's base form. The base form is the string with only its alphanumeric characters in lowercase. (For example "Cat-9?!" base form is "cat9"). If two strings have the same base form, they are considered to be the same. .. GENERATED FROM PYTHON SOURCE LINES 29-32 .. code-block:: default import pandas as pd .. GENERATED FROM PYTHON SOURCE LINES 33-35 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 35-44 .. code-block:: default from deepchecks.tabular.checks import StringMismatchComparison data = {'col1': ['Deep', 'deep', 'deep!!!', 'earth', 'foo', 'bar', 'foo?']} compared_data = {'col1': ['Deep', 'deep', '$deeP$', 'earth', 'foo', 'bar', 'foo?', '?deep']} check = StringMismatchComparison() result = check.run(pd.DataFrame(data=data), pd.DataFrame(data=compared_data)) result .. raw:: html

String Mismatch Comparison

Detect different variants of string categories between the same categorical column in two datasets.

Additional Outputs
* showing only the top 10 columns, you can change it using n_top_columns param
Column name col1
Base form deep
Common variants ['Deep', 'deep']
Variants only in test ['?deep', '$deeP$']
% Unique variants out of all dataset samples (count) 25% (2)
Variants only in train ['deep!!!']
% Unique variants out of all baseline samples (count) 14.29% (1)


.. GENERATED FROM PYTHON SOURCE LINES 45-47 Define a Condition ================== .. GENERATED FROM PYTHON SOURCE LINES 47-51 .. code-block:: default check = StringMismatchComparison().add_condition_no_new_variants() result = check.run(pd.DataFrame(data=data), pd.DataFrame(data=compared_data)) result.show(show_additional_outputs=False) .. raw:: html
String Mismatch Comparison


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.059 seconds) .. _sphx_glr_download_checks_gallery_tabular_integrity_plot_string_mismatch_comparison.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_string_mismatch_comparison.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_string_mismatch_comparison.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_