UnderAnnotatedPropertySegments#

class UnderAnnotatedPropertySegments[source]#

Search for under annotated data segments.

The check is designed to help you easily identify under annotated segments of your data. The segments are based on the text properties - which are features extracted from the text, such as “language” and “number of words”. For more on properties, see the NLP Properties Guide.

In order to achieve this, the check trains several simple tree based models which try to predict given a sample properties whether it will have a label. The relevant segments are detected by analyzing the different leafs of the trained trees.

Parameters
propertiesUnion[Hashable, List[Hashable]] , default: None

Properties to check, if none are given checks all properties except ignored ones.

ignore_propertiesUnion[Hashable, List[Hashable]] , default: None

Properties to ignore, if none given checks based on properties variable

n_top_propertiesOptional[int] , default: 10

Number of properties to use for segment search. Top properties are selected based on feature importance.

segment_minimum_size_ratio: float , default: 0.05

Minimum size ratio for segments. Will only search for segments of size >= segment_minimum_size_ratio * data_size.

n_samplesint , default: 10_000

Maximum number of samples to use for this check.

n_to_showint , default: 3

number of segments with the weakest performance to show.

categorical_aggregation_thresholdfloat , default: 0.05

In each categorical column, categories with frequency below threshold will be merged into “Other” category.

multiple_segments_per_propertybool , default: False

If True, will allow the same property to be a segmenting feature in multiple segments, otherwise each property can appear in one segment at most.

__init__(properties: Optional[Union[Hashable, List[Hashable]]] = None, ignore_properties: Optional[Union[Hashable, List[Hashable]]] = None, n_top_properties: Optional[int] = 10, segment_minimum_size_ratio: float = 0.05, n_samples: int = 10000, categorical_aggregation_threshold: float = 0.05, n_to_show: int = 3, multiple_segments_per_property: bool = False, **kwargs)[source]#
__new__(*args, **kwargs)#

Attributes

UnderAnnotatedPropertySegments.categorical_aggregation_threshold

UnderAnnotatedPropertySegments.min_category_size_ratio

UnderAnnotatedPropertySegments.n_to_show

UnderAnnotatedPropertySegments.n_top_features

UnderAnnotatedPropertySegments.random_state

UnderAnnotatedPropertySegments.segment_minimum_size_ratio

Methods

UnderAnnotatedPropertySegments.add_condition(...)

Add new condition function to the check.

UnderAnnotatedPropertySegments.add_condition_segments_annotation_ratio_greater_than([...])

Add condition - check that the in all segments annotation ratio is above the provided threshold.

UnderAnnotatedPropertySegments.add_condition_segments_relative_performance_greater_than([...])

Add condition - check that the score of the weakest segment is greater than supplied relative threshold.

UnderAnnotatedPropertySegments.clean_conditions()

Remove all conditions from this check instance.

UnderAnnotatedPropertySegments.conditions_decision(result)

Run conditions on given result.

UnderAnnotatedPropertySegments.config([...])

Return check configuration (conditions' configuration not yet supported).

UnderAnnotatedPropertySegments.from_config(conf)

Return check object from a CheckConfig object.

UnderAnnotatedPropertySegments.from_json(conf)

Deserialize check instance from JSON string.

UnderAnnotatedPropertySegments.metadata([...])

Return check metadata.

UnderAnnotatedPropertySegments.name()

Name of class in split camel case.

UnderAnnotatedPropertySegments.params([...])

Return parameters to show when printing the check.

UnderAnnotatedPropertySegments.remove_condition(index)

Remove given condition by index.

UnderAnnotatedPropertySegments.run(dataset)

Run check.

UnderAnnotatedPropertySegments.run_logic(...)

Run check.

UnderAnnotatedPropertySegments.to_json([...])

Serialize check instance to JSON string.

Examples#