Train-Test Validation Suite Quickstart#

The deepchecks train-test validation suite is relevant any time you wish to validate two data subsets. For example:

  • Comparing distributions across different train-test splits (e.g. before training a model or when splitting data for cross-validation)

  • Comparing a new data batch to previous data batches

Here we’ll use a loans’ dataset (deepchecks.tabular.datasets.classification.lending_club), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find.

# Before we start, if you don't have deepchecks installed yet, run:
import sys
!{sys.executable} -m pip install deepchecks -U --quiet

# or install using pip from your python environment

Load Data and Prepare Data#

Load Data#

from deepchecks.tabular.datasets.classification import lending_club
import pandas as pd

data = lending_club.load_data(data_format='Dataframe', as_train_test=False)
data.head(2)
issue_d sub_grade term home_ownership fico_range_low total_acc pub_rec revol_util annual_inc int_rate dti purpose mort_acc loan_amnt application_type installment verification_status pub_rec_bankruptcies addr_state initial_list_status fico_range_high revol_bal id open_acc emp_length loan_status time_to_earliest_cr_line
0 2017-06-01 D1 36 months MORTGAGE 665.0 29.0 0.0 85.0 112600.0 17.09 22.71 debt_consolidation 4.0 24000.0 Individual 856.75 Verified 0.0 CO w 669.0 25779.0 110680237 13.0 2.0 0 794188.8
1 2017-06-01 C2 36 months RENT 670.0 14.0 0.0 34.8 35000.0 13.59 9.95 debt_consolidation 1.0 4200.0 Individual 142.72 Source Verified NaN FL f 674.0 3798.0 109936186 7.0 2.0 1 470793.6


Split Data to Train and Test#

# convert date column to datetime, `issue_d`` is date column
data['issue_d'] = pd.to_datetime(data['issue_d'])

# Use data from June and July for train and August for test:
train_df = data[data['issue_d'].dt.month.isin([6, 7])]
test_df = data[data['issue_d'].dt.month.isin([8])]

Run Deepchecks for Train Test Validation#

Define a Dataset Object#

Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out deepchecks.tabular.Dataset to see all of the columns and types that can be declared.

Define Lending Club Metadata#

categorical_features = ['addr_state', 'application_type', 'home_ownership', \
  'initial_list_status', 'purpose', 'term', 'verification_status', 'sub_grade']
index_name = 'id'
label = 'loan_status' # 0 is DEFAULT, 1 is OK
datetime_name = 'issue_d'

Create Dataset#

from deepchecks.tabular import Dataset

# Categorical features can be heuristically inferred, however we
# recommend to state them explicitly to avoid misclassification.

# Metadata attributes are optional. Some checks will run only if specific attributes are declared.

train_ds = Dataset(train_df, label=label,cat_features=categorical_features, \
                   index_name=index_name, datetime_name=datetime_name)
test_ds = Dataset(test_df, label=label,cat_features=categorical_features, \
                   index_name=index_name, datetime_name=datetime_name)
# for convenience lets save it in a dictionary so we can reuse them for future Dataset initializations
columns_metadata = {'cat_features' : categorical_features, 'index_name': index_name,
                    'label':label, 'datetime_name':datetime_name}

Run the Deepchecks Suite#

Validate your data with the deepchecks.tabular.suites.train_test_validation suite. It runs on two datasets, so you can use it to compare any two batches of data (e.g. train data, test data, a new batch of data that recently arrived)

Check out the when you should use for some more info about the existing suites and when to use them.

from deepchecks.tabular.suites import train_test_validation

validation_suite = train_test_validation()
suite_result = validation_suite.run(train_ds, test_ds)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()
suite_result
Train Test Validation Suite:
|            | 0/12 [Time: 00:00]
Train Test Validation Suite:
|█████       | 5/12 [Time: 00:00, Check=Date Train Test Leakage Duplicates]
Train Test Validation Suite:
|█████████   | 9/12 [Time: 00:02, Check=Feature Label Correlation Change]
Train Test Validation Suite:
|███████████ | 11/12 [Time: 00:04, Check=Label Drift]
Train Test Validation Suite:
|████████████| 12/12 [Time: 00:05, Check=Multivariate Drift]
Train Test Validation Suite


As you can see in the suite’s results: the Date Train-Test Leakage check failed, indicating that we may have a problem in the way we’ve split our data! We’ve mixed up data from two years, causing a leakage of future data in the training dataset. Let’s fix this.

Fix Data#

dt_col = data[datetime_name]
train_df = data[dt_col.dt.year.isin([2017]) & dt_col.dt.month.isin([6,7,8])]
test_df = data[dt_col.dt.year.isin([2018]) & dt_col.dt.month.isin([6,7,8])]
from deepchecks.tabular import Dataset

# Create the new Datasets
train_ds = Dataset(train_df, **columns_metadata)
test_ds = Dataset(test_df, **columns_metadata)

Re-run Validation Suite#

suite_result = validation_suite.run(train_ds, test_ds)
suite_result.show()
Train Test Validation Suite:
|            | 0/12 [Time: 00:00]
Train Test Validation Suite:
|█████       | 5/12 [Time: 00:00, Check=Date Train Test Leakage Duplicates]
Train Test Validation Suite:
|█████████   | 9/12 [Time: 00:02, Check=Feature Label Correlation Change]
Train Test Validation Suite:
|███████████ | 11/12 [Time: 00:04, Check=Label Drift]
Train Test Validation Suite:
|████████████| 12/12 [Time: 00:05, Check=Multivariate Drift]
Train Test Validation Suite


Ok, the date leakage doesn’t happen anymore!

However, in the current split after the fix, we can see that we have a multivariate drift, detected by the Multivariate Drift check. The drift is caused mainly by a combination of features representing the loan’s interest rate (int_rate) and its grade (sub_grade). In order to proceed, we should think about the two options we have: To split the data in a different manner, or to stay with the current split.

For working with different data splits: We can consider examining other sampling techniques (e.g. using only data from the same year), ideally achieving one in which the training data’s univariate and multivariate distribution is similar to the data on which the model will run (test / production data). Of course, we can use deepchecks to validate the new splits.

If the current split is representative and we are planning on training a model with it, it is worth understanding this drift (do we expect this kind of drift in the model’s production environment? can we do something about it?).

For more details about drift, see the Drift User Guide.

Run a Single Check#

We can run a single check on a dataset, and see the results.

# If we want to run only that check (possible with or without condition)
from deepchecks.tabular.checks import MultivariateDrift

check_with_condition = MultivariateDrift().add_condition_overall_drift_value_less_than(0.4)
# or just the check without the condition:
# check = MultivariateDrift()
dataset_drift_result = check_with_condition.run(train_ds, test_ds)

We can also inspect and use the result’s value:

dataset_drift_result.value
{'domain_classifier_auc': 0.7328426082404159, 'domain_classifier_drift_score': 0.46568521648083183, 'domain_classifier_feature_importance': {'int_rate': 0.5092728878422134, 'sub_grade': 0.38754783632616996, 'dti': 0.031056814836620684, 'revol_bal': 0.030762437444804377, 'initial_list_status': 0.02458051221666191, 'emp_length': 0.012952605239917654, 'pub_rec': 0.0038269060936120774, 'time_to_earliest_cr_line': 0.0, 'verification_status': 0.0, 'term': 0.0, 'purpose': 0.0, 'home_ownership': 0.0, 'application_type': 0.0, 'addr_state': 0.0, 'annual_inc': 0.0, 'revol_util': 0.0, 'open_acc': 0.0, 'total_acc': 0.0, 'fico_range_high': 0.0, 'pub_rec_bankruptcies': 0.0, 'installment': 0.0, 'loan_amnt': 0.0, 'mort_acc': 0.0, 'fico_range_low': -0.0}}

and see if the conditions have passed

dataset_drift_result.passed_conditions()
False

Create a Custom Suite#

To create our own suite, we can simply write all of the checks, and add optional conditions.

from deepchecks.tabular import Suite
from deepchecks.tabular.checks import FeatureDrift, MultivariateDrift, \
 PredictionDrift, LabelDrift

drift_suite = Suite('drift suite',
FeatureDrift().add_condition_drift_score_less_than(
  max_allowed_categorical_score=0.2, max_allowed_numeric_score=0.1),
MultivariateDrift().add_condition_overall_drift_value_less_than(0.4),
LabelDrift(),
PredictionDrift()
)

we can run our new suite using:

result = drift_suite.run(train_ds, test_ds)
result.show()
drift suite:
|     | 0/4 [Time: 00:00]
drift suite:
|█▎   | 1/4 [Time: 00:02, Check=Feature Drift]
drift suite:
|██▌  | 2/4 [Time: 00:02, Check=Multivariate Drift]
drift suite


Total running time of the script: (0 minutes 16.242 seconds)

Gallery generated by Sphinx-Gallery