Conflicting Labels

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/integrity/plot_conflicting_labels.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_integrity_plot_conflicting_labels.py: Conflicting Labels ****************** This notebooks provides an overview for using and understanding the conflicting labels check. **Structure:** * `What are Conflicting Labels? <#what-are-conflicting-labels>`__ * `Load Data <#load-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ What are Conflicting Labels? ============================ The check searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model. .. GENERATED FROM PYTHON SOURCE LINES 22-25 .. code-block:: default import pandas as pd from deepchecks.tabular import Dataset .. GENERATED FROM PYTHON SOURCE LINES 26-29 .. code-block:: default from deepchecks.tabular.checks.integrity import ConflictingLabels from deepchecks.tabular.datasets.classification.phishing import load_data .. GENERATED FROM PYTHON SOURCE LINES 30-32 Load Data ========= .. GENERATED FROM PYTHON SOURCE LINES 32-37 .. code-block:: default phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe') phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars']) .. GENERATED FROM PYTHON SOURCE LINES 38-40 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 40-43 .. code-block:: default ConflictingLabels().run(phishing_dataset) .. raw:: html

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs

Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 17

	urlLength	numDigits	numParams	num_%20	num_@	bodyLength	numTitles	numImages	numLinks	specialChars
Observed Labels
(0, 1)	81	6	0	0	0	0	0	0	0	0
(0, 1)	82	2	0	0	0	0	0	0	0	0
(0, 1)	85	0	0	0	0	0	0	0	0	0
(0, 1)	85	20	0	0	0	0	0	0	0	0
(0, 1)	88	0	0	0	0	0	0	0	0	0

.. GENERATED FROM PYTHON SOURCE LINES 44-45 We can also check label ambiguity on a subset of the features: .. GENERATED FROM PYTHON SOURCE LINES 45-48 .. code-block:: default ConflictingLabels(n_to_show=1).run(phishing_dataset) .. raw:: html

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs

Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 1 of 17

	urlLength	numDigits	numParams	num_%20	num_@	bodyLength	numTitles	numImages	numLinks	specialChars
Observed Labels
(0, 1)	81	6	0	0	0	0	0	0	0	0

.. GENERATED FROM PYTHON SOURCE LINES 49-52 .. code-block:: default ConflictingLabels(columns=['urlLength', 'numDigits']).run(phishing_dataset) .. raw:: html

Conflicting Labels

Find samples which have the exact same features' values but different labels.

Additional Outputs

Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 78

	urlLength	numDigits
Observed Labels
(0, 1)	81	0
(0, 1)	81	6
(0, 1)	82	2
(0, 1)	84	2
(0, 1)	85	0

.. GENERATED FROM PYTHON SOURCE LINES 53-58 Define a Condition ================== Now, we define a condition that enforces that the ratio of samples with conflicting labels should be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 58-63 .. code-block:: default check = ConflictingLabels() check.add_condition_ratio_of_conflicting_labels_not_greater_than(0) result = check.run(phishing_dataset) result.show(show_additional_outputs=False) .. raw:: html

Conflicting Labels

.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.578 seconds) .. _sphx_glr_download_checks_gallery_tabular_integrity_plot_conflicting_labels.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_conflicting_labels.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_conflicting_labels.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_