Under Annotated Metadata Segments#

This notebook provides an overview for using and understanding the under annotated metadata segments check.

Structure:

What is the purpose of the check?
Automatically detecting under annotated segments
Generate data & model
Run the check
Define a condition

What is the purpose of the check?#

The Under-Annotated Metadata Segments check is designed to help you easily identify segments in your data which are under-annotated compared to the rest of your dataset, based on the provided metadata. The check could be very useful for example for identifying a specific data source for which there was less labeled data. The check can be guided to run only on a specific list of metadata columns, enabling you to focus on columns where you know a problem exists, or on important business segments.

Automatically detecting under annotated segments#

The check contains two main steps:

We train multiple simple tree based models, each one is trained using exactly two metadata columns (out of the ones selected above) to predict whether a sample will have a label.
We extract the corresponding data samples for each of the leaves in each of the trees (data segments) and calculate the annotation ratio in the samples within in. We keep the segments with the lowest annotation ratio.

Generate data & model#

from deepchecks.nlp.utils.test_utils import load_modified_tweet_text_data

text_data = load_modified_tweet_text_data()
text_data.metadata.head(3)

	user_age	gender	days_on_platform	user_region
0	30.73	Male	5614	Americas
1	42.29	Female	4308	Europe
2	24.97	Male	2729	Middle East/Africa

Run the check#

The check has several key parameters (that are all optional) that affect the behavior of the check and especially its output.

columns / ignore_columns: Controls which columns should be searched for under annotated segments. By default, uses all columns.

segment_minimum_size_ratio: Determines the minimum size of segments that are of interest. The check will return data segments that contain at least this fraction of the total data samples. It is recommended to try different configurations of this parameter as larger segments can be of interest even the model performance on them is superior.

categorical_aggregation_threshold: By default the check will combine rare categories into a single category called “Other”. This parameter determines the frequency threshold for categories to be mapped into to the “other” category.

multiple_segments_per_column: If True, will allow the same metadata column to be a segmenting feature in multiple segments, otherwise each metadata column can appear in one segment at most. True by default.

see API reference for more details.

from deepchecks.nlp.checks import UnderAnnotatedMetaDataSegments

check = UnderAnnotatedMetaDataSegments(segment_minimum_size_ratio=0.07,
                                       multiple_segments_per_column=True)
result = check.run(text_data)
result.show()

Under Annotated Meta Data Segments

Observe the check’s output#

We see in the results that the check indeed found several under annotated segments. In the scatter plot display we can see the under annotated segment as well as the annotation distribution with respect to the two metadata columns that are relevant to the segment. In order to get the full list of under annotated segments found we will inspect the result.value attribute. Shown below are the 3 segments with the worst performance.

result.value['weak_segments_list'].head(3)

	Annotation Ratio	Feature1	Feature1 Range	Feature2	Feature2 Range	% of Data	Samples in Segment
0	0.487032	user_region	[Europe]	user_age	(39.989999771118164, inf)	7.46	[1, 5, 9, 44, 69, 88, 95, 115, 132, 172, 173, ...
1	0.819444	user_region	[Europe]		None	26.31	[1, 4, 5, 8, 9, 11, 14, 16, 17, 30, 44, 47, 51...

Define a condition#

We can add a condition that will validate the annotation ratio in all data segment is above a certain threshold. A scenario where this can be useful is when we want to make sure that we have enough annotations for quality evaluation of the model or drift on a subset of the data that is of interest to us, for example for specific age or gender groups.

# Let's add a condition and re-run the check:

check.add_condition_segments_annotation_ratio_greater_than(0.7)
result = check.run(text_data)
result.show(show_additional_outputs=False)

Under Annotated Meta Data Segments

Total running time of the script: (0 minutes 1.200 seconds)

Gallery generated by Sphinx-Gallery

Under Annotated Property Segments

Frequent Substrings