Model Error Analysis#

This notebooks provides an overview for using and understanding the model error analysis check.

Structure:

What is Model Error Analysis?#

Evaluating the model’s overall performance metrics gives a good high-level overview and can be useful for tracking model progress during training of for comparing models. However, when it’s time to fully evaluate if a model is fit for production, or when you’re interested in a deeper understanding of your model’s performance in order to improve it or to be aware of its weaknesses, it’s recommended to look deeper at how the model performs on various segments of the data. The model error analysis check searches for data segments in which the model error is significantly lower from the model error of the dataset as a whole.

Algorithm:#

  1. Computes the per-sample loss (for log-loss for classification, mse for regression).

  2. Trains a regression model to predict the error of the user’s model, based on the input features.

  3. Repeat stage 2 several times with various tree parameters and random states to ensure that the most relevant partitions for model error are selected.

  4. The features scoring the highest feature importance for the error regression model are selected and the distribution of the error vs the feature values is plotted.

The check results are shown only if the error regression model manages to predict the error well enough (above a given r squared performance threshold, defined by the min_error_model_score parameter and set by default to 0.5). The resulting plots show the distribution of the error for the features that are most effective at segmenting the error to high and low values, without need for manual selection of segmentation features.

Run the check#

We will run the check on the adult dataset which can be downloaded from the UCI machine learning repository and is also available in deepchecks.tabular.datasets.

from deepchecks.tabular.datasets.classification import adult
from deepchecks.tabular.checks import ModelErrorAnalysis

train_ds, test_ds = adult.load_data(data_format='Dataset', as_train_test=True)
model = adult.load_fitted_model()

# We create the check with a slightly lower r squared threshold to ensure that the check can run on the example dataset.
check = ModelErrorAnalysis(min_error_model_score=0.3)
result = check.run(train_ds, test_ds, model)
result

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning:

Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead

/home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:290: UserWarning:

Calculating permutation feature importance without time limit. Expected to finish in 27 seconds

Model Error Analysis

Find features that best split the data into segments of high and low model error.

Additional Outputs
The following graphs show the distribution of error for top features that are most useful for distinguishing high error samples from low error samples.


The check has found that the features ‘hours-per-week’, ‘age’ and ‘relationship’ are the most predictive of differences in the model error. We can further investigate the model performance by passing two of these columns to the Segment Performance check:

from deepchecks.tabular.checks import SegmentPerformance

SegmentPerformance(feature_1='age', feature_2='relationship').run(test_ds, model)

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning:

Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead

Calculating permutation feature importance. Expected to finish in 42 seconds

Segment Performance

Display performance score segmented by 2 top (or given) features in a heatmap.

Additional Outputs


From which we learn that the model error is exceptionally higher for people in the “Husband” or “Other” status, except for the lower age groups for which the error is lower.

Define a condition#

We can define a condition that enforces that the relative difference between the weak and strong segments is not greater than a certain ratio, for example ratio of 0.05

check = check.add_condition_segments_performance_relative_difference_not_greater_than(0.05)
result = check.run(train_ds, test_ds, model)
result.show(show_additional_outputs=False)

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning:

Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead

/home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:290: UserWarning:

Calculating permutation feature importance without time limit. Expected to finish in 29 seconds
Model Error Analysis


Total running time of the script: ( 1 minutes 14.399 seconds)

Gallery generated by Sphinx-Gallery