.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/data_integrity/plot_under_annotated_property_segments.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_data_integrity_plot_under_annotated_property_segments.py: .. _nlp__under_annotated_property_segments: Under Annotated Property Segments ********************************* This notebook provides an overview for using and understanding the under annotated property segments check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Automatically detecting under annotated segments <#automatically-detecting-under-annotated-segments>`__ * `Generate data & model <#generate-data-model>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is the purpose of the check? ================================== The Under-Annotated Property Segments check is designed to help you easily identify segments in your data which are under-annotated compared to the rest of your dataset, based on the provided :ref:`properties `. The check could be very useful in identifying a specific data samples (for example less fluent or less formal samples) for which there was a problem in the annotation process. The check can be guided to run only on a specific list of properties, enabling you to focus on properties where you know an issue exists, or on important business segments. Automatically detecting under annotated segments ================================================ The check contains two main steps: #. We train multiple simple tree based models, each one is trained using exactly two properties (out of the ones selected above) to predict whether a sample will have a label. #. We extract the corresponding data samples for each of the leaves in each of the trees (data segments) and calculate the annotation ratio in the samples within in. We keep the segments with the lowest annotation ratio. .. GENERATED FROM PYTHON SOURCE LINES 41-43 Generate data & model ===================== .. GENERATED FROM PYTHON SOURCE LINES 43-49 .. code-block:: default from deepchecks.nlp.utils.test_utils import load_modified_tweet_text_data text_data = load_modified_tweet_text_data() text_data.properties.head(3) .. raw:: html
Text Length Average Word Length Max Word Length % Special Characters Language Sentiment Subjectivity Toxicity Fluency Formality
0 104 5.058824 11 0.057692 en -0.155556 0.288889 0.001683 0.896180 0.387794
1 98 6.071429 16 0.061224 en -0.250000 0.750000 0.020605 0.862289 0.224011
2 94 4.277778 8 0.021277 en 0.000000 0.750000 0.009497 0.349153 0.204132


.. GENERATED FROM PYTHON SOURCE LINES 50-71 Run the check ============= The check has several key parameters (that are all optional) that affect the behavior of the check and especially its output. ``properties / ignore_properties``: Controls which properties should be searched for under annotated segments. By default, uses all properties. ``segment_minimum_size_ratio``: Determines the minimum size of segments that are of interest. The check will return data segments that contain at least this fraction of the total data samples. It is recommended to try different configurations of this parameter as larger segments can be of interest even the model performance on them is superior. ``categorical_aggregation_threshold``: By default the check will combine rare categories into a single category called "Other". This parameter determines the frequency threshold for categories to be mapped into to the "other" category. ``multiple_segments_per_column``: If True, will allow the same property to be a segmenting feature in multiple segments, otherwise each property can appear in one segment at most. False by default. see :class:`API reference ` for more details. .. GENERATED FROM PYTHON SOURCE LINES 71-78 .. code-block:: default from deepchecks.nlp.checks import UnderAnnotatedPropertySegments check = UnderAnnotatedPropertySegments() result = check.run(text_data) result.show() .. raw:: html
Under Annotated Property Segments


.. GENERATED FROM PYTHON SOURCE LINES 79-87 Observe the check's output -------------------------- We see in the results that the check indeed found several under annotated segments. In the scatter plot display we can see the under annotated segment as well as the annotation distribution with respect to the two properties that are relevant to the segment. In order to get the full list of under annotated segments found we will inspect the ``result.value`` attribute. Shown below are the 3 segments with the worst performance. .. GENERATED FROM PYTHON SOURCE LINES 87-91 .. code-block:: default result.value['weak_segments_list'].head(3) .. raw:: html
Annotation Ratio Feature1 Feature1 Range Feature2 Feature2 Range % of Data Samples in Segment
0 0.469136 Text Length (80.5, 129.5) Formality (0.40074723958969116, inf) 8.7 [6, 15, 17, 18, 20, 27, 28, 39, 85, 91, 95, 97...


.. GENERATED FROM PYTHON SOURCE LINES 92-100 Define a condition ================== We can add a condition that will validate the annotation ratio in all data segment is above a certain threshold. A scenario where this can be useful is when we want to make sure that we have enough annotations for quality evaluation of the model or drift on a subset of the data that is of interest to us, for example for specific age or gender groups. .. GENERATED FROM PYTHON SOURCE LINES 100-107 .. code-block:: default # Let's add a condition and re-run the check: check = UnderAnnotatedPropertySegments() check.add_condition_segments_annotation_ratio_greater_than(0.7) result = check.run(text_data) result.show(show_additional_outputs=False) .. raw:: html
Under Annotated Property Segments


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 9.269 seconds) .. _sphx_glr_download_nlp_auto_checks_data_integrity_plot_under_annotated_property_segments.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_under_annotated_property_segments.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_under_annotated_property_segments.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_