MetadataSegmentsPerformance#

class MetadataSegmentsPerformance[source]#

Search for segments with low performance scores.

The check is designed to help you easily identify weak spots of your model and provide a deepdive analysis into its performance on different segments of your data. Specifically, it is designed to help you identify the model weakest segments in the data distribution for further improvement and visibility purposes.

The segments are based on the metadata - which is data that is not part of the text, but is related to it, such as “user_id” and “user_age”.

In order to achieve this, the check trains several simple tree based models which try to predict the error of the user provided model on the dataset. The relevant segments are detected by analyzing the different leafs of the trained trees.

Parameters

columnsUnion[Hashable, List[Hashable]] , default: None: Columns to check, if none are given checks all columns except ignored ones.
ignore_columnsUnion[Hashable, List[Hashable]] , default: None: Columns to ignore, if none given checks based on columns variable
n_top_columnsint , default: 10: Number of features to use for segment search. Top columns are selected based on feature importance.
segment_minimum_size_ratio: float , default: 0.05: Minimum size ratio for segments. Will only search for segments of size >= segment_minimum_size_ratio * data_size.
alternative_scorerTuple[str, Union[str, Callable]] , default: None: Scorer to use as performance measure, either function or sklearn scorer name. If None, a default scorer (per the model type) will be used.
loss_per_sample: Union[np.array, pd.Series, None], default: None: Loss per sample used to detect relevant weak segments. If pd.Series the indexes should be similar to those in the dataset object provide, if np.array the order should be based on the index order of the dataset object and if None the check calculates loss per sample by via log loss for classification and MSE for regression.
n_samplesint , default: 10_000: Maximum number of samples to use for this check.
n_to_showint , default: 3: number of segments with the weakest performance to show.
categorical_aggregation_thresholdfloat , default: 0.05: In each categorical column, categories with frequency below threshold will be merged into “Other” category.

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: int = 10, segment_minimum_size_ratio: float = 0.05, alternative_scorer: Optional[Dict[str, Callable]] = None, loss_per_sample: Optional[Union[ndarray, Series]] = None, n_samples: int = 10000, categorical_aggregation_threshold: float = 0.05, n_to_show: int = 3, **kwargs)[source]#

__new__(*args, **kwargs)#

Methods

`MetadataSegmentsPerformance.add_condition`(...)	Add new condition function to the check.
`MetadataSegmentsPerformance.add_condition_segments_relative_performance_greater_than`([...])	Add condition - check that the score of the weakest segment is greater than supplied relative threshold.
`MetadataSegmentsPerformance.clean_conditions`()	Remove all conditions from this check instance.
`MetadataSegmentsPerformance.conditions_decision`(result)	Run conditions on given result.
`MetadataSegmentsPerformance.config`([...])	Return check configuration (conditions' configuration not yet supported).
`MetadataSegmentsPerformance.from_config`(conf)	Return check object from a CheckConfig object.
`MetadataSegmentsPerformance.from_json`(conf)	Deserialize check instance from JSON string.
`MetadataSegmentsPerformance.metadata`([...])	Return check metadata.
`MetadataSegmentsPerformance.name`()	Name of class in split camel case.
`MetadataSegmentsPerformance.params`([...])	Return parameters to show when printing the check.
`MetadataSegmentsPerformance.remove_condition`(index)	Remove given condition by index.
`MetadataSegmentsPerformance.run`(dataset[, ...])	Run check.
`MetadataSegmentsPerformance.run_logic`(...)	Run check.
`MetadataSegmentsPerformance.to_json`([...])	Serialize check instance to JSON string.

SingleDatasetPerformance.to_json

MetadataSegmentsPerformance.add_condition