Note

Go to the end to download the full example code

Model Evaluation Suite Quickstart#

The deepchecks model evaluation suite is relevant any time you wish to evaluate your model. For example:

Thorough analysis of the model’s performance before deploying it.
Evaluation of a proposed model during the model selection and optimization stage.
Checking the model’s performance on a new batch of data (with or without comparison to previous data batches).

Here we’ll build a regression model using the wine quality dataset (deepchecks.tabular.datasets.regression.wine_quality), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find.

# Before we start, if you don't have deepchecks installed yet, run:
import sys
!{sys.executable} -m pip install deepchecks -U --quiet

# or install using pip from your python environment

Prepare Data and Model#

Load Data#

from deepchecks.tabular.datasets.regression import wine_quality

data = wine_quality.load_data(data_format='Dataframe', as_train_test=False)
data.head(2)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.0	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5

Split Data and Train a Simple Model#

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data['quality'], test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

GradientBoostingRegressor()

Run Deepchecks for Model Evaluation#

Create a Dataset Object#

Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out deepchecks.tabular.Dataset to see all the column types and attributes that can be declared.

from deepchecks.tabular import Dataset

# Categorical features can be heuristically inferred, however we
# recommend to state them explicitly to avoid misclassification.

# Metadata attributes are optional. Some checks will run only if specific attributes are declared.

train_ds = Dataset(X_train, label=y_train, cat_features=[])
test_ds = Dataset(X_test, label=y_test, cat_features=[])

Run the Deepchecks Suite#

Validate your data with the deepchecks.tabular.suites.model_evaluation suite. It runs on two datasets and a model, so you can use it to compare the performance of the model between any two batches of data (e.g. train data, test data, a new batch of data that recently arrived)

Check out the when you should use for some more info about the existing suites and when to use them.

from deepchecks.tabular.suites import model_evaluation

evaluation_suite = model_evaluation()
suite_result = evaluation_suite.run(train_ds, test_ds, gbr)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()
suite_result.show()

Model Evaluation Suite:
|           | 0/11 [Time: 00:00]
Model Evaluation Suite:
|█          | 1/11 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|█████      | 5/11 [Time: 00:00, Check=Simple Model Comparison]
Model Evaluation Suite:
|████████   | 8/11 [Time: 00:04, Check=Regression Error Distribution]
Model Evaluation Suite:
|██████████ | 10/11 [Time: 00:04, Check=Boosting Overfit]

Model Evaluation Suite

Status	Check	Condition	More Info
✖	Train Test Performance	Train-Test scores relative degradation is less than 0.1	3 scores failed. Found max degradation of 27.91% for metric R2
✖	Regression Error Distribution - Test Dataset	Systematic error ratio lower than 0.01	Found systematic error to rmse ratio of 0.05
!	Weak Segments Performance - Test Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with neg rmse score of -1.074 in comparison to an average score of -0.602 in sampled data.
✓	Regression Error Distribution - Test Dataset	Kurtosis value higher than -0.1	Found kurtosis value of 0.47079

Conditions Summary

Status	Condition	More Info
✖	Train-Test scores relative degradation is less than 0.1	3 scores failed. Found max degradation of 27.91% for metric R2

Conditions Summary

Status	Condition	More Info
✖	Systematic error ratio lower than 0.01	Found systematic error to rmse ratio of 0.05
✓	Kurtosis value higher than -0.1	Found kurtosis value of 0.47079

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	predicted quality	quality Prediction Difference
1449	7.20	0.38	0.31	2.00	0.06	15.00	29.00	0.99	3.23	0.76	11.30	8	6.28	1.72
1436	10.00	0.38	0.38	1.60	0.17	27.00	90.00	1.00	3.15	0.65	8.50	5	3.49	1.51
1269	5.50	0.49	0.03	1.80	0.04	28.00	87.00	0.99	3.50	0.82	14.00	8	6.54	1.46

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	predicted quality	quality Prediction Difference
1505	6.70	0.76	0.02	1.80	0.08	6.00	12.00	1.00	3.55	0.63	9.95	3	5.46	-2.46
462	11.00	0.26	0.68	2.55	0.09	10.00	25.00	1.00	3.18	0.61	11.80	5	6.76	-1.76
1484	6.80	0.91	0.06	2.00	0.06	4.00	11.00	1.00	3.53	0.64	10.90	4	5.62	-1.62

Conditions Summary

Status	Condition	More Info
!	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with neg rmse score of -1.074 in comparison to an average score of -0.602 in sampled data.

Status	Check	Condition	More Info
✓	Prediction Drift	Prediction drift score < 0.15	Found model prediction Kolmogorov-Smirnov drift score of 0.05
✓	Simple Model Comparison	Model performance gain over simple model is greater than 10%	All metrics passed, metric's gain: {'Neg RMSE': '25.71%'}
✓	Weak Segments Performance - Train Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with neg rmse score of -0.514 in comparison to an average score of -0.499 in sampled data.
✓	Regression Error Distribution - Train Dataset	Kurtosis value higher than -0.1	Found kurtosis value of 0.55157
✓	Regression Error Distribution - Train Dataset	Systematic error ratio lower than 0.01	Found systematic error to rmse ratio of 9.33E-16
✓	Boosting Overfit	Test score over iterations is less than 5% from the best score	Found score decline of 0%
✓	Model Inference Time - Train Dataset	Average model inference time for one sample is less than 0.001	Found average inference time (seconds): 2.18e-06
✓	Model Inference Time - Test Dataset	Average model inference time for one sample is less than 0.001	Found average inference time (seconds): 4.14e-06
✓	Unused Features - Train Dataset	Number of high variance unused features is less or equal to 5	Found 0 high variance unused features
✓	Unused Features - Test Dataset	Number of high variance unused features is less or equal to 5	Found 0 high variance unused features

Conditions Summary

Status	Condition	More Info
✓	Prediction drift score < 0.15	Found model prediction Kolmogorov-Smirnov drift score of 0.05

Conditions Summary

Status	Condition	More Info
✓	Model performance gain over simple model is greater than 10%	All metrics passed, metric's gain: {'Neg RMSE': '25.71%'}

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with neg rmse score of -0.514 in comparison to an average score of -0.499 in sampled data.

Conditions Summary

Status	Condition	More Info
✓	Kurtosis value higher than -0.1	Found kurtosis value of 0.55157
✓	Systematic error ratio lower than 0.01	Found systematic error to rmse ratio of 9.33E-16

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	predicted quality	quality Prediction Difference
7	7.30	0.65	0.00	1.20	0.07	15.00	21.00	0.99	3.39	0.47	10.00	7	5.37	1.63
8	7.80	0.58	0.02	2.00	0.07	9.00	18.00	1.00	3.36	0.57	9.50	7	5.44	1.56
1403	7.20	0.33	0.33	1.70	0.06	3.00	13.00	1.00	3.23	1.10	10.00	8	6.44	1.56

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	predicted quality	quality Prediction Difference
1469	7.30	0.98	0.05	2.10	0.06	20.00	49.00	1.00	3.31	0.55	9.70	3	5.02	-2.02
1478	7.10	0.88	0.05	5.70	0.08	3.00	14.00	1.00	3.40	0.52	10.20	3	4.58	-1.58
1423	6.40	0.53	0.09	3.90	0.12	14.00	31.00	1.00	3.50	0.67	11.00	4	5.57	-1.57

Conditions Summary

Status	Condition	More Info
✓	Test score over iterations is less than 5% from the best score	Found score decline of 0%

Conditions Summary

Status	Condition	More Info
✓	Average model inference time for one sample is less than 0.001	Found average inference time (seconds): 2.18e-06

Conditions Summary

Status	Condition	More Info
✓	Average model inference time for one sample is less than 0.001	Found average inference time (seconds): 4.14e-06

Check	Reason
Roc Report - Train Dataset	Check is irrelevant for regression tasks
Roc Report - Test Dataset	Check is irrelevant for regression tasks
Confusion Matrix Report - Train Dataset	Check is irrelevant for regression tasks
Confusion Matrix Report - Test Dataset	Check is irrelevant for regression tasks
Calibration Score - Train Dataset	Check is irrelevant for regression tasks
Calibration Score - Test Dataset	Check is irrelevant for regression tasks

Analyzing the results#

The result showcase a number of interesting insights, first let’s inspect the “Didn’t Pass” section.

Train Test Performance check result implies that the model overfitted the training data.
Regression Systematic Error (test set) check result demonstrate the model small positive bias.
Weak Segments Performance (test set) check result visualize some specific sub-spaces on which the model performs poorly. Examples for those sub-spaces are wines with low total sulfur dioxide and wines with high alcohol percentage.

Next, let’s examine the “Passed” section.

Simple Model Comparison check result states that the model performs better than naive baseline models, an opposite result could indicate a problem with the model or the data it was trained on.
Boosting Overfit check and the Unused Features check results implies that the model has a well calibrating boosting stopping rule and that it make good use on the different data features.

Let’s try and fix the overfitting issue found in the model.

Fix the Model and Re-run a Single Check#

from deepchecks.tabular.checks import TrainTestPerformance

gbr = GradientBoostingRegressor(n_estimators=20)
gbr.fit(X_train, y_train)
# Initialize the check and add an optional condition
check = TrainTestPerformance().add_condition_train_test_relative_degradation_less_than(0.3)
result = check.run(train_ds, test_ds, gbr)
result.show()

Train Test Performance

Conditions Summary

Status	Condition	More Info
✓	Train-Test scores relative degradation is less than 0.3	Found max degradation of 13.99% for metric R2

We mitigated the overfitting to some extent. Additional model tuning is required to overcome other issues discussed above. For now, we will update and remove the relevant conditions from the suite.

Updating an Existing Suite#

To create our own suite, we can start with an empty suite and add checks and condition to it (see Create a Custom Suite), or we can start with one of the default suites and update it as demonstrated in this section.

let’s inspect our model evaluation suite’s structure

evaluation_suite

Model Evaluation Suite: [
    0: TrainTestPerformance
            Conditions:
                    0: Train-Test scores relative degradation is less than 0.1
    1: RocReport
            Conditions:
                    0: AUC score for all the classes is greater than 0.7
    2: ConfusionMatrixReport
    3: PredictionDrift
            Conditions:
                    0: Prediction drift score < 0.15
    4: SimpleModelComparison
            Conditions:
                    0: Model performance gain over simple model is greater than 10%
    5: WeakSegmentsPerformance(n_to_show=5)
            Conditions:
                    0: The relative performance of weakest segment is greater than 80% of average model performance.
    6: CalibrationScore
    7: RegressionErrorDistribution
            Conditions:
                    0: Kurtosis value higher than -0.1
                    1: Systematic error ratio lower than 0.01
    8: UnusedFeatures
            Conditions:
                    0: Number of high variance unused features is less or equal to 5
    9: BoostingOverfit
            Conditions:
                    0: Test score over iterations is less than 5% from the best score
    10: ModelInferenceTime
            Conditions:
                    0: Average model inference time for one sample is less than 0.001
]

Next, we will update the Train Test Performance condition and remove the Regression Systematic Error check:

evaluation_suite[0].clean_conditions()
evaluation_suite[0].add_condition_train_test_relative_degradation_less_than(0.3)
evaluation_suite = evaluation_suite.remove(7)

Re-run the suite using:

result = evaluation_suite.run(train_ds, test_ds, gbr)
result.passed(fail_if_warning=False)

Model Evaluation Suite:
|          | 0/10 [Time: 00:00]
Model Evaluation Suite:
|█         | 1/10 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|█████     | 5/10 [Time: 00:00, Check=Simple Model Comparison]
Model Evaluation Suite:
|████████  | 8/10 [Time: 00:04, Check=Unused Features]
Model Evaluation Suite:
|██████████| 10/10 [Time: 00:04, Check=Model Inference Time]

True

For more info about working with conditions, see the detailed Configure Check Conditions guide.

Total running time of the script: (0 minutes 11.140 seconds)

Gallery generated by Sphinx-Gallery

Full Suite Quickstart

Data Integrity Suite Quickstart

Model Evaluation Suite Quickstart#

Prepare Data and Model#

Load Data#

Split Data and Train a Simple Model#

Run Deepchecks for Model Evaluation#

Create a Dataset Object#

Run the Deepchecks Suite#

Model Evaluation Suite

Train Test Performance

Conditions Summary

Additional Outputs

Regression Error Distribution - Test Dataset

Conditions Summary

Additional Outputs

Weak Segments Performance - Test Dataset

Conditions Summary

Additional Outputs

Prediction Drift

Conditions Summary

Additional Outputs

Simple Model Comparison

Conditions Summary

Additional Outputs

Weak Segments Performance - Train Dataset

Conditions Summary

Additional Outputs

Regression Error Distribution - Train Dataset

Conditions Summary

Additional Outputs

Boosting Overfit

Conditions Summary

Additional Outputs

Model Inference Time - Train Dataset

Conditions Summary

Additional Outputs

Model Inference Time - Test Dataset

Conditions Summary

Additional Outputs

Analyzing the results#

Fix the Model and Re-run a Single Check#

Train Test Performance

Conditions Summary

Additional Outputs

Updating an Existing Suite#