Performance Bias#

This notebook provides an overview for using and understanding the Performance Bias check.

Structure:

What is the purpose of the check?
Generate data & model
Run the check
Define a condition

What is the purpose of the check?#

The check is designed to help you identify subgroups for which the model has a much lower performance score than its baseline score (its overall performance). The subgroups are defined by a chosen protected feature (e.g., “sex”, “race”) and you can specify a control feature (e.g., “education”) by which to group the data before computing performance differences. This is primarily useful for fairness analyses, but can also be used to identify other types of performance disparities.

Large performance disparities can indicate a problem with the model. The training data may not be sufficient for certain subgroups or may contain biases, or the model may need to be re-calibrated when applied to certain subgroups. When using appropriate scoring functions, looking at performance disparities can help uncover issues of these kinds.

Remember that this check relies on labeled data provided in the dataset. As such, it can only assess performance disparities to the extent that the labeled data is accurate and representative of the population of interest. Using scoring functions that are robust to class imbalance or that are computed for each model class can help mitigate this issue.

Generate data & model#

from deepchecks.tabular.datasets.classification.adult import (
    load_data, load_fitted_model)

train_dataset, test_dataset = load_data()
model = load_fitted_model()

Run the check#

The check requires the argument protected_feature identifying a column that defines the subgroups for which performance disparities are assessed. In addition, the check has several optional parameters that affect its behavior and output.

control_feature: Column to use to split the data by groups prior to computing performance disparities.
scorer: Scoring function to measure performance. Default to “accuracy” for classification tasks and “r2” for regression tasks.
max_subgroups_per_control_cat_to_display: Maximum number of subgroups (per control_feature category) to display.
max_control_cat_to_display: Maximum number of control_feature categories to display.

see API reference for more details.

from deepchecks.tabular.checks.model_evaluation import PerformanceBias

check = PerformanceBias(
   protected_feature="race",
   control_feature="education",
   scorer="accuracy",
   max_segments=3)
result = check.run(test_dataset, model)
result.show()

Performance Bias

Observe the check’s output#

We see in the results that the check identified the largest performance disparity for the subgroup “Others” within the category of “HS-grad” for the control feature “education”. The model performance on this subgroup is 0.095 versus 0.258 for this entire education category.

result.value['scores_df'].head(3)

	race	education	_scorer	_score	_baseline	_baseline_count	_count	_diff
12	Black	Masters	accuracy	0.684211	0.770878	934	57	-0.086667
26	Others	Others	accuracy	0.848485	0.900067	1501	99	-0.051582
22	Others	Assoc-voc	accuracy	0.766667	0.807069	679	30	-0.040403

Define a condition#

We can define on our check a condition that will validate all performance disparities fall within a certain threshold. If the condition is not met, the check will fail.

Let’s add a condition and re-run the check:

check.add_condition_bounded_performance_difference(lower_bound=-0.1)
result = check.run(test_dataset, model)
result.show(show_additional_outputs=False)

Performance Bias

Total running time of the script: ( 0 minutes 6.885 seconds)

Gallery generated by Sphinx-Gallery

Multi Model Performance Report

Prediction Drift