Identifier Label Correlation#

This notebook provides an overview for using and understanding the identifier-label correlation check.

This check computes the Predictive Power Score (PPS) meaning, the ability of a unique identifier (index or datetime) column to predict the label.

High predictive score could indicate a problem in the data collection pipeline, and even though the identifier column doesn’t directly enter the model, collecting the data differently for different labels could have an indirect influence on the data.

Structure:

Generate Data
Run the Check
Define a Condition

Imports#

import numpy as np
import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import IdentifierLabelCorrelation

Generate Data#

np.random.seed(42)
df = pd.DataFrame(np.random.randn(100, 3), columns=['x1', 'x2', 'x3'])
df['x4'] = df['x1'] * 0.05 + df['x2']
df['x5'] = df['x2']*121 + 0.01 * df['x1']
df['label'] = df['x5'].apply(lambda x: 0 if x < 0 else 1)

dataset = Dataset(df, label='label', index_name='x1', datetime_name='x2', cat_features=[])

Run The Check#

check = IdentifierLabelCorrelation()
check.run(dataset)

# To display the results in an IDE like PyCharm, you can use the following code:
# check.run(ds).show()
# The result will be displayed in a new window.

Identifier Label Correlation

Define a Condition#

Now we will define a condition that the PPS should be less than or equal to 0.2.

result = check.add_condition_pps_less_or_equal(max_pps=0.2).run(dataset)
result.show(show_additional_outputs=False)

Identifier Label Correlation

Total running time of the script: (0 minutes 0.247 seconds)

Gallery generated by Sphinx-Gallery

Outlier Sample Detection

Conflicting Labels