API Reference - ConflictingLabels

Note

Go to the end to download the full example code

Conflicting Labels#

This notebook provides an overview for using and understanding the Conflicting Labels check:

Structure:

Why check for conflicting labels?
Create TextData
Run the Check
Define a Condition

Why check for conflicting labels?#

The ConflictingLabels check finds identical or nearly identical (see text normalization) samples in the dataset that have different labels. Conflicting labels can lead to inconsistencies and confusion for the model during training. Identifying such samples can help in cleaning the data and improving the model’s performance.

Create TextData#

Lets create a simple dataset with some samples having conflicting labels.

from deepchecks.nlp import TextData
from deepchecks.nlp.checks import ConflictingLabels

texts = [
    "Deep learning is a subset of machine learning.",
    "Deep learning is a subset of machine learning.",
    "Deep learning is a sub-set of Machine Learning.",
    "Deep learning is subset of machine learning",
    "Natural language processing is a subfield of AI.",
    "This is a unique text sample.",
    "This is another unique text.",
]

labels = [0, 1, 1, 0, 2, 2, 2]

dataset = TextData(texts, label=labels, task_type='text_classification')

Run the Check#

# Run the check without any text normalization
ConflictingLabels(
    ignore_case=False,
    remove_punctuation=False,
    normalize_unicode=False,
    remove_stopwords=False,
    ignore_whitespace=False
).run(dataset)

Conflicting Labels

		Text
Observed Labels	Sample IDs
0, 1	0, 1	Deep learning is a subset of m...

		Text
Observed Labels	Sample IDs
0, 1	0, 1	Deep learning is a subset of m...

With Text Normalization#

By default, ConflictingLabels check applies text normalization before identifying the conflicting labels. This includes case normalization, punctuation removal, Unicode normalization and stopwords removal. You can also customize the normalization as per your requirements:

ConflictingLabels(
    ignore_case=True,
    remove_punctuation=True,
    normalize_unicode=True,
    remove_stopwords=True,
    ignore_whitespace=True
).run(dataset)

Conflicting Labels

		Text
Observed Labels	Sample IDs
0, 1, 1, 0	0, 1, 2, 3	Deep learning is a subset of m...

		Text
Observed Labels	Sample IDs
0, 1, 1, 0	0, 1, 2, 3	Deep learning is a subset of m...

Of all the parameters in this example, ignore_whitespace is the only one set to False by default.

Define a Condition#

Now, we define a condition that enforces the ratio of samples with conflicting labels to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = ConflictingLabels()
check.add_condition_ratio_of_conflicting_labels_less_or_equal(0)
result = check.run(dataset)
result.show(show_additional_outputs=False)

Conflicting Labels

Conditions Summary

Status	Condition	More Info
✖	Ambiguous sample ratio is less or equal to 0%	Ratio of samples with conflicting labels: 57.14%

Conditions Summary

Status	Condition	More Info
✖	Ambiguous sample ratio is less or equal to 0%	Ratio of samples with conflicting labels: 57.14%

Total running time of the script: (0 minutes 0.096 seconds)

Gallery generated by Sphinx-Gallery

Text Data Duplicates

Train Test Validation

Conflicting Labels#

Why check for conflicting labels?#

Create TextData#

Run the Check#

Conflicting Labels

Additional Outputs

Conflicting Labels

Additional Outputs

With Text Normalization#

Conflicting Labels

Additional Outputs

Conflicting Labels

Additional Outputs

Define a Condition#

Conflicting Labels

Conditions Summary

Conflicting Labels

Conditions Summary