Pytest#

This tutorial demonstrates how deepchecks can be used inside unit tests performed on data or model, with the pytest framework. We will use the diabetes dataset from scikit-learn, and check whether certain columns contain drift between the training and the test sets.

import pytest
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from deepchecks import Dataset
from deepchecks.tabular.checks import FeatureDrift
from deepchecks.tabular.suites import data_integrity

Defining Pytest Fixtures#

pytest fixtures provide a defined, reliable and consistent context for the tests. This could include environment (for example a database configured with known parameters) or content (such as a dataset). In this tutorial we will define a fixture that load the diabetes dataset from scikit-learn.

@pytest.fixture(scope='session')
def diabetes_df():
    diabetes = load_diabetes(return_X_y=False, as_frame=True).frame
    return diabetes

Implementing the Test#

Now, we will implement a test that will check if some columns in the dataset have drifted between the train and test datasets. the test sets.

def test_diabetes_drift(diabetes_df):
    train_df, test_df = train_test_split(diabetes_df, test_size=0.33, random_state=42)
    train = Dataset(train_df, label='target', cat_features=['sex'])
    test = Dataset(test_df, label='target', cat_features=['sex'])

    check = FeatureDrift(columns=['age', 'sex', 'bmi'])
    check.add_condition_drift_score_not_greater_than(max_allowed_psi_score=0.2,
                                                     max_allowed_earth_movers_score=0.1)

    result = check.run(train, test)

    assert result.passed_conditions()

Please note the passed_conditions() method of the deepchecks.core.CheckResult object. This method will return True if all the conditions are met, and False otherwise.

It’s possible to evaluate the result of a suite of checks, and to get the overall result of the test, by using the deepchecks.core.SuiteResult.passed() method.

def test_diabetes_integrity(diabetes_df):
    ds = Dataset(diabetes_df, label='target', cat_features=['sex'])

    suite = data_integrity()
    result = suite.run(ds)

    assert result.passed(fail_if_warning=True, fail_if_check_not_run=False)

Spark & Databricks

H2O