Metrics Guide#

In this guide we’ll explain how to customize the metrics that deepchecks uses to validate and monitor your model performance. Controlling the metrics helps you shape the checks and suites according to the specifics of your use case.

Structure:

Default Metrics#

All of the checks that evaluate model performance, such as SingleDatasetPerformance come with default metrics.

The default metrics by task type are:

Tabular#

Binary classification:

  • Accuracy 'accuracy'

  • Precision 'precision'

  • Recall 'recall'

Multiclass classification averaged over the classes:

  • Accuracy 'accuracy'

  • Precision 'precision_macro'

  • Recall 'recall_macro'

Multiclass classification per class:

  • F1 'f1_per_class'

  • Precision 'precision_per_class'

  • Recall 'recall_per_class'

Regression:

  • Negative RMSE 'neg_rmse'

  • Negative MAE 'neg_mae'

  • R2 'r2'

Note

Deepchecks follow the convention that greater metric value represent better performance. Therefore, it is recommended to only use metrics that follow this convention, for example, Negative MAE instead of MAE.

Vision#

Classification:

  • Precision 'precision_per_class'

  • Recall 'recall_per_class'

Object detection:

  • Mean average precision 'average_precision_per_class'

  • Mean average recall 'average_recall_per_class'

Running a Check with Default Metrics#

To run a check with the default metrics, run it without passing any value to the “scorers” parameter. We will demonstrate it using the ClassPerformance check:

from deepchecks.vision.checks import ClassPerformance
from deepchecks.vision.datasets.classification import mnist
mnist_model = mnist.load_model()
train_ds = mnist.load_dataset(train=True, object_type='VisionData')
test_ds = mnist.load_dataset(train=False, object_type='VisionData')
check = ClassPerformance()
result = check.run(train_ds, test_ds, mnist_model)

Alternative Metrics#

Sometimes the defaults don’t fit the specifics of the use case. If this is the case, you can pass a list of supported metric strings or a dict in the format {metric_name_string: metric} to the scorers parameter of the check or suite.

The metrics in the dict can be some of the existing:

  • Strings from Deepchecks’ supported strings for both vision and tabular.

  • Ignite Metrics for vision. An Ignite Metric is a class with the methods: reset, compute, and update, that iterates over batches of data and aggregates the result.

  • Deepchecks Metrics for vision Metrics implemented by Deepchecks as custom Ignite Metrics. Using customized Deepchecks Metrics, such as the object detection metric MeanIoU, is useful for example when defining custom confidence or IoU thresholds is needed. You can import them from deepchecks.vision.metrics.

  • Scikit-learn Scorers for both vision and tabular. A Scikit-learn Scorer is a function that accepts the parameters: (model, x, y_true), and returns a score with the convention that higher is better.

  • Your own implementation.

Example for passing strings:

from deepchecks.tabular.checks import TrainTestPerformance
from deepchecks.tabular.datasets.classification import adult
train_ds, test_ds = adult.load_data(data_format='Dataset', as_train_test=True)
model = adult.load_fitted_model()

scorer = ['precision_per_class', 'recall_per_class', 'fnr_macro']
check = TrainTestPerformance(scorers=scorer)
result = check.run(train_ds, test_ds, model)

Example for passing Deepchecks metrics:

from deepchecks.vision.metrics import MeanDice
from deepchecks.vision.datasets.segmentation.segmentation_coco import load_dataset, load_model

coco_dataset = load_dataset()
coco_model = load_model()
metric = {'mean_dice': MeanDice(threshold=0.9)}

check = SingleDatasetPerformance(scorers=metric)
result = check.run(coco_dataset, coco_model)

List of Supported Strings#

In addition to the strings listed below, all Sklearn scorer strings apply.

Regression#

String

Metric

Comments

‘neg_rmse’

negative root mean squared error

higher value represents better performance

‘neg_mae’

negative mean absolute error

higher value represents better performance

‘rmse’

root mean squared error

not recommended, see note.

‘mae’

mean absolute error

not recommended, see note.

‘mse’

mean squared error

not recommended, see note.

‘r2’

R2 score

Classification#

Note

For classification tasks, Deepchecks requires an ordered list of all possible classes (Can also be inferred from provided data and model). It is also recommended to supply the model’s output probabilities per class, as they are required for some metrics and checks which will not work without them. See link for additional information.

String

Metric

Comments

‘accuracy’

classification accuracy

scikit-learn

‘roc_auc’

Area Under the Receiver Operating Characteristic Curve (ROC AUC) - binary

for multiclass averaging options see scikit-learn’s documentation

‘roc_auc_per_class’

Area Under the Receiver Operating Characteristic Curve (ROC AUC) - score per class

for multiclass averaging options see scikit-learn’s documentation

‘f1’

F-1 - binary

‘f1_per_class’

F-1 per class - no averaging

‘f1_macro’

F-1 - macro averaging

‘f1_micro’

F-1 - micro averaging

‘f1_weighted’

F-1 - macro, weighted by support

‘precision’

precision

suffixes apply as with ‘f1’

‘recall’ , ‘sensitivity’

recall (sensitivity)

suffixes apply as with ‘f1’

‘fpr’

False Positive Rate - binary

suffixes apply as with ‘f1’

‘fnr’

False Negative Rate - binary

suffixes apply as with ‘f1’

‘tnr’, ‘specificity’

True Negative Rate - binary

suffixes apply as with ‘f1’

‘roc_auc’

AUC - binary

‘roc_auc_per_class’

AUC per class - no averaging

‘roc_auc_ovr’

AUC - One-vs-rest

‘roc_auc_ovo’

AUC - One-vs-One

‘roc_auc_ovr_weighted’

AUC - One-vs-rest, weighted by support

‘roc_auc_ovo_weighted’

AUC - One-vs-One, weighted by support

Object Detection#

String

Metric

Comments

‘average_precision_per_class’

average precision for object detection

‘average_precision_macro’

average precision macro averaging

‘average_precision_weighted’

average precision macro, weighted by support

‘average_recall_per_class’

average recall for object detection

suffixes apply as with ‘average_precision’

Custom Metrics#

You can also pass your own custom metric to relevant checks and suites.

For computer vision the custom metrics should support the Ignite Metric API.

For tabular metrics the custom metrics should support the sklearn scorer API. Multiclass classification scorers should assume that the labels are given in a multi-label format (a binary matrix). Binary classification scorers should assume that the labels are given as 0 and 1.

Tabular Example#

from deepchecks.tabular.datasets.classification import adult
from deepchecks.tabular.suites import model_evaluation
from sklearn.metrics import cohen_kappa_score, fbeta_score, make_scorer

f1_scorer = make_scorer(fbeta_score, labels=[0, 1], average=None, beta=0.2)
ck_scorer = make_scorer(cohen_kappa_score)
custom_scorers = {'f1': f1_scorer, 'cohen': ck_scorer}

train_ds, test_ds = adult.load_data(data_format='Dataset', as_train_test=True)
model = adult.load_fitted_model()
suite = model_evaluation(scorers=custom_scorers)
result = suite.run(train_ds, test_ds, model)

Vision Example#

from ignite.metrics import Precision
from deepchecks.vision.checks import SingleDatasetPerformance

precision = Precision(average=True)
double_precision = 2 * precision

check = SingleDatasetPerformance(scorers={'precision2': double_precision})
result = check.run(train_ds, mnist_model)