.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/data_integrity/plot_identifier_label_correlation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_data_integrity_plot_identifier_label_correlation.py: .. _plot_tabular_identifier_label_correlation: Identifier Label Correlation **************************** This notebook provides an overview for using and understanding the identifier-label correlation check. This check computes the Predictive Power Score (:ref:`PPS `) meaning, the ability of a unique identifier (index or datetime) column to predict the label. High predictive score could indicate a problem in the data collection pipeline, and even though the identifier column doesn't directly enter the model, collecting the data differently for different labels could have an indirect influence on the data. **Structure:** * `Generate Data <#generate-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ .. GENERATED FROM PYTHON SOURCE LINES 26-28 Imports ======= .. GENERATED FROM PYTHON SOURCE LINES 28-35 .. code-block:: default import numpy as np import pandas as pd from deepchecks.tabular import Dataset from deepchecks.tabular.checks import IdentifierLabelCorrelation .. GENERATED FROM PYTHON SOURCE LINES 36-38 Generate Data =============== .. GENERATED FROM PYTHON SOURCE LINES 38-45 .. code-block:: default np.random.seed(42) df = pd.DataFrame(np.random.randn(100, 3), columns=['x1', 'x2', 'x3']) df['x4'] = df['x1'] * 0.05 + df['x2'] df['x5'] = df['x2']*121 + 0.01 * df['x1'] df['label'] = df['x5'].apply(lambda x: 0 if x < 0 else 1) .. GENERATED FROM PYTHON SOURCE LINES 46-49 .. code-block:: default dataset = Dataset(df, label='label', index_name='x1', datetime_name='x2', cat_features=[]) .. GENERATED FROM PYTHON SOURCE LINES 50-52 Run The Check ============== .. GENERATED FROM PYTHON SOURCE LINES 52-60 .. code-block:: default check = IdentifierLabelCorrelation() check.run(dataset) # To display the results in an IDE like PyCharm, you can use the following code: # check.run(ds).show() # The result will be displayed in a new window. .. raw:: html
Identifier Label Correlation


.. GENERATED FROM PYTHON SOURCE LINES 61-64 Define a Condition ================== Now we will define a condition that the PPS should be less than or equal to 0.2. .. GENERATED FROM PYTHON SOURCE LINES 64-66 .. code-block:: default result = check.add_condition_pps_less_or_equal(max_pps=0.2).run(dataset) result.show(show_additional_outputs=False) .. raw:: html
Identifier Label Correlation


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.203 seconds) .. _sphx_glr_download_checks_gallery_tabular_data_integrity_plot_identifier_label_correlation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_identifier_label_correlation.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_identifier_label_correlation.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_