Note
Click here to download the full example code
Create a Custom Suite#
A suite is a list of checks that will run one after the other, and its results will be displayed together.
To customize a suite, we can either:
Create new custom suites, by choosing the checks (and the optional conditions) that we want the suite to have.
Modify a built-in suite by adding and/or removing checks and conditions, to adapt it to our needs.
Create a New Suite#
Let’s say we want to create our custom suite, mainly with various performance checks,
including PerformanceReport(), TrainTestDifferenceOverfit()
and several more.
For assistance in understanding which checks are implemented and can be included, we suggest using any of:
Built-in suites (by printing them to see which checks they include)
from sklearn.metrics import make_scorer, precision_score, recall_score
from deepchecks.tabular import Suite
# importing all existing checks for demonstration simplicity
from deepchecks.tabular.checks import *
# The Suite's first argument is its name, and then all of the check objects.
# Some checks can receive arguments when initialized (all check arguments have default values)
# Each check can have an optional condition(/s)
# Multiple conditions can be applied subsequentially
new_custom_suite = Suite('Simple Suite For Model Performance',
ModelInfo(),
# use custom scorers for performance report:
PerformanceReport().add_condition_train_test_relative_degradation_not_greater_than(threshold=0.15\
).add_condition_test_performance_not_less_than(0.8),
ConfusionMatrixReport(),
SimpleModelComparison(simple_model_type='constant', \
alternative_scorers={'Recall (Multiclass)': make_scorer(recall_score, average=None), \
'Precision (Multiclass)': make_scorer(precision_score, average=None)} \
).add_condition_gain_not_less_than(0.3)
)
# Let's see the suite:
new_custom_suite
Out:
Simple Suite For Model Performance: [
0: ModelInfo
1: PerformanceReport
Conditions:
0: Train-Test scores relative degradation is not greater than 0.15
1: Scores are not less than 0.8
2: ConfusionMatrixReport
3: SimpleModelComparison
Conditions:
0: Model performance gain over simple model is not less than 30%
]
TIP: the auto-complete may not work from inside a new suite definition, so if you want to use the auto-complete to see the arguments a check receive or the built-in conditions it has, try doing it outside of the suite’s initialization.
For example, to see a check’s built-in conditions, type in a new cell: ``NameOfDesiredCheck().add_condition_`` and then check the auto-complete suggestions (using Shift + Tab), to discover the built-in checks.
Additional Notes about Conditions in a Suite#
Checks in the built-in suites come with pre-defined conditions, and when building your custom suite you should choose which conditions to add.
Most check classes have built-in methods for adding monditions. These apply to the naming convention
add_condition_...
, which enables adding a condition logic to parse the check’s results.Each check instance can have several conditions or none. Each condition will be evaluated separately.
The pass (✓) / fail (✖) / insight (!) status of the conditions, along with the condition’s name and extra info will be displayed in the suite’s Conditions Summary.
Most conditions have configurable arguments that can be passed to the condition while adding it.
For more info about conditions, check out Configure a Condition.
Run the Suite#
This is simply done by calling the run()
method of the suite.
To see that in action, we’ll need datasets and a model.
Let’s quickly load a dataset and train a simple model for the sake of this demo
Load Datasets and Train a Simple Model#
import numpy as np
# General imports
import pandas as pd
np.random.seed(22)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from deepchecks.tabular.datasets.classification import iris
# Load pre-split Datasets
train_dataset, test_dataset = iris.load_data(as_train_test=True)
label_col = 'target'
# Train Model
rf_clf = RandomForestClassifier()
rf_clf.fit(train_dataset.data[train_dataset.features],
train_dataset.data[train_dataset.label_name]);
Out:
RandomForestClassifier()
Run Suite#
new_custom_suite.run(model=rf_clf, train_dataset=train_dataset, test_dataset=test_dataset)
Out:
Simple Suite For Model Performance: 0%| | 0/4 [00:00<?, ? Check/s]
Simple Suite For Model Performance: 0%| | 0/4 [00:00<?, ? Check/s, Check=Model Info]
Simple Suite For Model Performance: 25%|# | 1/4 [00:00<00:00, 53.76 Check/s, Check=Performance Report]
Simple Suite For Model Performance: 50%|## | 2/4 [00:00<00:00, 10.04 Check/s, Check=Performance Report]
Simple Suite For Model Performance: 50%|## | 2/4 [00:00<00:00, 10.04 Check/s, Check=Confusion Matrix Report]
Simple Suite For Model Performance: 75%|### | 3/4 [00:00<00:00, 10.04 Check/s, Check=Simple Model Comparison]Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
Simple Suite For Model Performance: 100%|####| 4/4 [00:00<00:00, 11.93 Check/s, Check=Simple Model Comparison]
Modify an Existing Suite#
from deepchecks.tabular.suites import train_test_leakage
customized_suite = train_test_leakage()
# let's check what it has:
customized_suite
Out:
Train Test Leakage Suite: [
0: DateTrainTestLeakageDuplicates
Conditions:
0: Date leakage ratio is not greater than 0%
1: DateTrainTestLeakageOverlap
Conditions:
0: Date leakage ratio is not greater than 0%
2: SingleFeatureContributionTrainTest(ppscore_params={})
Conditions:
0: Train-Test features' Predictive Power Score difference is not greater than 0.2
1: Train features' Predictive Power Score is not greater than 0.7
3: TrainTestSamplesMix
Conditions:
0: Percentage of test data samples that appear in train data not greater than 10%
4: IdentifierLeakage(ppscore_params={})
Conditions:
0: Identifier columns PPS is not greater than 0
5: IndexTrainTestLeakage
Conditions:
0: Ratio of leaking indices is not greater than 0%
]
# and modify it by removing a check by index:
customized_suite.remove(1)
Out:
Train Test Leakage Suite: [
0: DateTrainTestLeakageDuplicates
Conditions:
0: Date leakage ratio is not greater than 0%
2: SingleFeatureContributionTrainTest(ppscore_params={})
Conditions:
0: Train-Test features' Predictive Power Score difference is not greater than 0.2
1: Train features' Predictive Power Score is not greater than 0.7
3: TrainTestSamplesMix
Conditions:
0: Percentage of test data samples that appear in train data not greater than 10%
4: IdentifierLeakage(ppscore_params={})
Conditions:
0: Identifier columns PPS is not greater than 0
5: IndexTrainTestLeakage
Conditions:
0: Ratio of leaking indices is not greater than 0%
]
from deepchecks.tabular.checks import UnusedFeatures
# and add a new check with a condition:
customized_suite.add(
UnusedFeatures().add_condition_number_of_high_variance_unused_features_not_greater_than())
Out:
Train Test Leakage Suite: [
0: DateTrainTestLeakageDuplicates
Conditions:
0: Date leakage ratio is not greater than 0%
2: SingleFeatureContributionTrainTest(ppscore_params={})
Conditions:
0: Train-Test features' Predictive Power Score difference is not greater than 0.2
1: Train features' Predictive Power Score is not greater than 0.7
3: TrainTestSamplesMix
Conditions:
0: Percentage of test data samples that appear in train data not greater than 10%
4: IdentifierLeakage(ppscore_params={})
Conditions:
0: Identifier columns PPS is not greater than 0
5: IndexTrainTestLeakage
Conditions:
0: Ratio of leaking indices is not greater than 0%
6: UnusedFeatures
Conditions:
0: Number of high variance unused features is not greater than 5
]
# lets remove all condition for the SingleFeatureContributionTrainTest:
customized_suite[3].clean_conditions()
# and update the suite's name:
customized_suite.name = 'New Data Leakage Suite'
# and now we can run our modified suite:
customized_suite.run(train_dataset, test_dataset, rf_clf)
Out:
New Data Leakage Suite: 0%| | 0/6 [00:00<?, ? Check/s]
New Data Leakage Suite: 0%| | 0/6 [00:00<?, ? Check/s, Check=Date Train Test Leakage Duplicates]
New Data Leakage Suite: 17%|# | 1/6 [00:00<00:00, 6260.16 Check/s, Check=Single Feature Contribution Train Test]
New Data Leakage Suite: 33%|## | 2/6 [00:00<00:00, 30.32 Check/s, Check=Train Test Samples Mix]
New Data Leakage Suite: 50%|### | 3/6 [00:00<00:00, 38.50 Check/s, Check=Identifier Leakage]
New Data Leakage Suite: 67%|#### | 4/6 [00:00<00:00, 51.00 Check/s, Check=Index Train Test Leakage]
New Data Leakage Suite: 83%|##### | 5/6 [00:00<00:00, 63.62 Check/s, Check=Unused Features]
New Data Leakage Suite: 100%|######| 6/6 [00:00<00:00, 43.42 Check/s, Check=Unused Features]
Total running time of the script: ( 0 minutes 2.378 seconds)