Model Evaluation Suite Quickstart#

The deepchecks model evaluation suite is relevant any time you wish to evaluate your model. For example:

Thorough analysis of the model’s performance before deploying it.
Evaluation of a proposed model during the model selection and optimization stage.
Checking the model’s performance on a new batch of data (with or without comparison to previous data batches).

Here we’ll build a regression model using the wine quality dataset (deepchecks.tabular.datasets.regression.wine_quality), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find.

# Before we start, if you don't have deepchecks installed yet, run:
import sys
!{sys.executable} -m pip install deepchecks -U --quiet

# or install using pip from your python environment

Prepare Data and Model#

Load Data#

from deepchecks.tabular.datasets.regression import wine_quality

data = wine_quality.load_data(data_format='Dataframe', as_train_test=False)
data.head(2)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.0	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5

Split Data and Train a Simple Model#

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data['quality'], test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

GradientBoostingRegressor()

Run Deepchecks for Model Evaluation#

Create a Dataset Object#

Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out deepchecks.tabular.Dataset to see all the column types and attributes that can be declared.

from deepchecks.tabular import Dataset

# Categorical features can be heuristically inferred, however we
# recommend to state them explicitly to avoid misclassification.

# Metadata attributes are optional. Some checks will run only if specific attributes are declared.

train_ds = Dataset(X_train, label=y_train, cat_features=[])
test_ds = Dataset(X_test, label=y_test, cat_features=[])

Run the Deepchecks Suite#

Validate your data with the deepchecks.tabular.suites.model_evaluation suite. It runs on two datasets and a model, so you can use it to compare the performance of the model between any two batches of data (e.g. train data, test data, a new batch of data that recently arrived)

Check out the when you should use for some more info about the existing suites and when to use them.

from deepchecks.tabular.suites import model_evaluation

evaluation_suite = model_evaluation()
suite_result = evaluation_suite.run(train_ds, test_ds, gbr)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()
suite_result.show()

Model Evaluation Suite:
|           | 0/11 [Time: 00:00]
Model Evaluation Suite:
|#          | 1/11 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|#####      | 5/11 [Time: 00:00, Check=Simple Model Comparison]
Model Evaluation Suite:
|########   | 8/11 [Time: 00:03, Check=Regression Error Distribution]
Model Evaluation Suite:
|########## | 10/11 [Time: 00:04, Check=Boosting Overfit]

Model Evaluation Suite

Analyzing the results#

The result showcase a number of interesting insights, first let’s inspect the “Didn’t Pass” section.

Train Test Performance check result implies that the model overfitted the training data.
Regression Systematic Error (test set) check result demonstrate the model small positive bias.
Weak Segments Performance (test set) check result visualize some specific sub-spaces on which the model performs poorly. Examples for those sub-spaces are wines with low total sulfur dioxide and wines with high alcohol percentage.

Next, let’s examine the “Passed” section.

Simple Model Comparison check result states that the model performs better than naive baseline models, an opposite result could indicate a problem with the model or the data it was trained on.
Boosting Overfit check and the Unused Features check results implies that the model has a well calibrating boosting stopping rule and that it make good use on the different data features.

Let’s try and fix the overfitting issue found in the model.

Fix the Model and Re-run a Single Check#

from deepchecks.tabular.checks import TrainTestPerformance

gbr = GradientBoostingRegressor(n_estimators=20)
gbr.fit(X_train, y_train)
# Initialize the check and add an optional condition
check = TrainTestPerformance().add_condition_train_test_relative_degradation_less_than(0.3)
result = check.run(train_ds, test_ds, gbr)
result.show()

Train Test Performance

We mitigated the overfitting to some extent. Additional model tuning is required to overcome other issues discussed above. For now, we will update and remove the relevant conditions from the suite.

Updating an Existing Suite#

To create our own suite, we can start with an empty suite and add checks and condition to it (see Create a Custom Suite), or we can start with one of the default suites and update it as demonstrated in this section.

let’s inspect our model evaluation suite’s structure

evaluation_suite

Model Evaluation Suite: [
    0: TrainTestPerformance
            Conditions:
                    0: Train-Test scores relative degradation is less than 0.1
    1: RocReport
            Conditions:
                    0: AUC score for all the classes is greater than 0.7
    2: ConfusionMatrixReport
    3: PredictionDrift
            Conditions:
                    0: categorical drift score < 0.15 and numerical drift score < 0.15
    4: SimpleModelComparison
            Conditions:
                    0: Model performance gain over simple model is greater than 10%
    5: WeakSegmentsPerformance(n_to_show=5)
            Conditions:
                    0: The relative performance of weakest segment is greater than 80% of average model performance.
    6: CalibrationScore
    7: RegressionErrorDistribution
            Conditions:
                    0: Kurtosis value higher than -0.1
                    1: Systematic error ratio lower than 0.01
    8: UnusedFeatures
            Conditions:
                    0: Number of high variance unused features is less or equal to 5
    9: BoostingOverfit
            Conditions:
                    0: Test score over iterations is less than 5% from the best score
    10: ModelInferenceTime
            Conditions:
                    0: Average model inference time for one sample is less than 0.001
]

Next, we will update the Train Test Performance condition and remove the Regression Systematic Error check:

evaluation_suite[0].clean_conditions()
evaluation_suite[0].add_condition_train_test_relative_degradation_less_than(0.3)
evaluation_suite = evaluation_suite.remove(7)

Re-run the suite using:

result = evaluation_suite.run(train_ds, test_ds, gbr)
result.passed(fail_if_warning=False)

Model Evaluation Suite:
|          | 0/10 [Time: 00:00]
Model Evaluation Suite:
|#         | 1/10 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|#####     | 5/10 [Time: 00:00, Check=Simple Model Comparison]
Model Evaluation Suite:
|########  | 8/10 [Time: 00:03, Check=Unused Features]
Model Evaluation Suite:
|##########| 10/10 [Time: 00:04, Check=Model Inference Time]

True

For more info about working with conditions, see the detailed Configure Check Conditions guide.

Total running time of the script: ( 0 minutes 10.161 seconds)

Gallery generated by Sphinx-Gallery

Full Suite Quickstart

Data Integrity Suite Quickstart