Conflicting Labels#

This notebooks provides an overview for using and understanding the conflicting labels check.

Structure:

What are Conflicting Labels?#

The check searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model.

import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks.integrity import ConflictingLabels
from deepchecks.tabular.datasets.classification.phishing import load_data

Load Data#

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])

Run the Check#

ConflictingLabels().run(phishing_dataset)

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs
Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 17
urlLength numDigits numParams num_%20 num_@ bodyLength numTitles numImages numLinks specialChars
Observed Labels
(0, 1) 81 6 0 0 0 0 0 0 0 0
(0, 1) 82 2 0 0 0 0 0 0 0 0
(0, 1) 85 0 0 0 0 0 0 0 0 0
(0, 1) 85 20 0 0 0 0 0 0 0 0
(0, 1) 88 0 0 0 0 0 0 0 0 0


We can also check label ambiguity on a subset of the features:

ConflictingLabels(n_to_show=1).run(phishing_dataset)

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs
Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 1 of 17
  urlLength numDigits numParams num_%20 num_@ bodyLength numTitles numImages numLinks specialChars
Observed Labels                    
(0, 1) 81 6 0 0 0 0 0 0 0 0


ConflictingLabels(columns=['urlLength', 'numDigits']).run(phishing_dataset)

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs
Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 78
urlLength numDigits
Observed Labels
(0, 1) 81 0
(0, 1) 81 6
(0, 1) 82 2
(0, 1) 84 2
(0, 1) 85 0


Define a Condition#

Now, we define a condition that enforces that the ratio of samples with conflicting labels should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = ConflictingLabels()
check.add_condition_ratio_of_conflicting_labels_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)
Conflicting Labels


Total running time of the script: ( 0 minutes 3.578 seconds)

Gallery generated by Sphinx-Gallery