Conflicting Labels#

This notebook provides an overview for using and understanding the conflicting labels check.

Structure:

What are Conflicting Labels?
Load Data
Run the Check
Define a Condition

What are Conflicting Labels?#

The check searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model.

import pandas as pd

from deepchecks.tabular import Dataset

from deepchecks.tabular.checks import ConflictingLabels
from deepchecks.tabular.datasets.classification.phishing import load_data

Load Data#

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])

Run the Check#

ConflictingLabels().run(phishing_dataset)

Conflicting Labels

		urlLength
Observed Labels	Instances
(0, 1)	6649, 10249, 4865, 4355, 2763, 3109, 495...	(85, 0, 0, 0, 0, 0, 0, 0, 0, 0)
	3643, 3133, 3625, 10982, 10034, 9364, 24...	(88, 0, 0, 0, 0, 0, 0, 0, 0, 0)
	10383, 5665, 337, 7652, 9464, 10522, 219...	(102, 0, 0, 0, 0, 0, 0, 0, 0, 0)
	6070, 4029, 4657, 2076, 8568	(94, 0, 0, 0, 0, 0, 0, 0, 0, 0)
	2507, 5074, 4530, 6619, 10738	(109, 1, 0, 0, 0, 0, 0, 0, 0, 0)

We can also check label ambiguity on a subset of the features:

ConflictingLabels(n_to_show=1).run(phishing_dataset)

Conflicting Labels

		urlLength
Observed Labels	Instances
(0, 1)	6649, 10249, 4865, 4355, 2763, 3109, 495...	(85, 0, 0, 0, 0, 0, 0, 0, 0, 0)

ConflictingLabels(columns=['urlLength', 'numDigits']).run(phishing_dataset)

Conflicting Labels

		urlLength
Observed Labels	Instances
(0, 1)	6649, 2586, 1683, 7367, 11027, 1662, 785...	(85, 0)
	101, 11310, 5081, 4353, 1160, 8218, 6885...	(91, 0)
	9786, 10022, 8701, 5955, 3643, 9846, 119...	(88, 0)
	2638, 318, 408, 7100, 1803, 10553, 2818,...	(98, 0)
	5218, 2625, 70, 8514, 6913, 6070, 5792, ...	(94, 0)

Define a Condition#

Now, we define a condition that enforces that the ratio of samples with conflicting labels should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = ConflictingLabels()
check.add_condition_ratio_of_conflicting_labels_less_or_equal(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Conflicting Labels

Conditions Summary

Status	Condition	More Info
✖	Ambiguous sample ratio is less or equal to 0%	Ratio of samples with conflicting labels: 0.6%

Total running time of the script: (0 minutes 5.260 seconds)

Gallery generated by Sphinx-Gallery

Identifier Label Correlation

Data Duplicates

Conflicting Labels#

What are Conflicting Labels?#

Load Data#

Run the Check#

Conflicting Labels

Additional Outputs

Conflicting Labels

Additional Outputs

Conflicting Labels

Additional Outputs

Define a Condition#

Conflicting Labels

Conditions Summary